Thread: moving CSV data into table?
I need to ask about best practices. I now have periodic bulk CSV data needing to be added to an existing table. I assume it is best to first move the data into a temporary table & scrub it before moving into the final table. However, the incoming data does not have the same schema as the permanent target as additional attributes (columns) have been added. Is it better to massage the temporary table into the same target's schema first, & use COPY to copy all temporary contents into the permanent table, or is scripting a better solution? Is there a better method?
Your insights would be appreciated.
Thanks.
Your insights would be appreciated.
Thanks.
On 09/02/2011 10:08 AM, James Hartley wrote: > I need to ask about best practices. I now have periodic bulk CSV data > needing to be added to an existing table. I assume it is best to > first move the data into a temporary table & scrub it before moving > into the final table. However, the incoming data does not have the > same schema as the permanent target as additional attributes (columns) > have been added. Is it better to massage the temporary table into the > same target's schema first, & use COPY to copy all temporary contents > into the permanent table, or is scripting a better solution? Is there > a better method? > > Your insights would be appreciated. > > Thanks. As always the answer is, "it depends". Is performance a concern? What type of scrubbing must be done? Does scrubbing require comparison to existing database data?Is lights-out automation required? A combination of the above? You may find it preferable to do pre-scrubbing (sed/grep/awk type of stuff that doesn't require access to existing data in the database) prior to initial import. I would typically create a temporary table that matches the structure of the data you are bringing in then craft a query or set of queries to scrub and import the data. Notes: You don't need to alter the temporary table structure to copy data - just select the data you want. Suppose your temporary import table has 5 columns a,b,c,d and e but your permanent table has only a, b and c. The import is basically: insert into permtable (a, b, c) select a, b, c from temptable; The select can be as complicated as you want. You can eliminate duplicates, perform calculations, add static data like a batch number, etc. For example, the following would import "a" unchanged, put b+d from the temp table into b in the permanent table, import "c" unchanged and put the integer 1 into batchnum. It also eliminates records in the import table where e is zero and where a would create a duplicate a in the permanent table: insert into permtable (a, b, c, batchnum) select a, b+d, c, 1::int from temptable t where e !=0 and not exists (select 1 from permtable p where p.a = t.a); Also note that for performance, the temporary table should really be temporary (i.e. create temporary table foo...). Temporary tables are "unlogged", visible only to the current connection and deleted when the connection closes. Since they are unlogged, you eliminate the performance cost of maintaining the write-ahead logs. (One upcoming feature is that you can specify a permanent table is unlogged - useful for certain tables where data-loss is not a problem.) Cheers, Steve
On Fri, Sep 2, 2011 at 11:16 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
Thank you for these gems! Temporary tables are "unlogged", visible only to the current connection and deleted when the connection closes.
One question. Assuming I run psql on the same machine as the server itself, temporary tables will be deleted when psql exits?
On 09/02/2011 12:07 PM, James Hartley wrote:
That is correct - temporary tables disappear as soon as the connection closes. This is true whether the connection is local or remote and regardless of the client be it psql, Perl, Python, PHP, Prolog or even clients that don't start with "p".
Cheers,
Steve
On Fri, Sep 2, 2011 at 11:16 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:Thank you for these gems!Temporary tables are "unlogged", visible only to the current connection and deleted when the connection closes.
One question. Assuming I run psql on the same machine as the server itself, temporary tables will be deleted when psql exits?
That is correct - temporary tables disappear as soon as the connection closes. This is true whether the connection is local or remote and regardless of the client be it psql, Perl, Python, PHP, Prolog or even clients that don't start with "p".
Cheers,
Steve