Re: Extending copy_expert - Mailing list psycopg
From | Adrian Klaver |
---|---|
Subject | Re: Extending copy_expert |
Date | |
Msg-id | 543BD762.4070009@aklaver.com Whole thread Raw |
In response to | Extending copy_expert (Andrea Riciputi <andrea.riciputi@gmail.com>) |
Responses |
Re: Extending copy_expert
|
List | psycopg |
On 10/12/2014 02:28 PM, Andrea Riciputi wrote: > Hi all, > a couple of weeks ago at work we had to produce a quite big CSV data file which should be used as input by another pieceof software. > > Since the file must be produced on a daily basis, is big (let say half a TB), and it contains data stored in our PG database,letting PG produce the file itself seemed the right approach. Thanks to psycopg the whole operation is performedin C, resulting fast enough for our purpose. > > However the target software for which the file is produced, is, let say, “legacy” software and can only accept CRLF asEOL character. However by calling COPY TO STDOUT from psycopg ends up in a CSV file with LF as EOL forcing us to pass thefile a second time to convert EOL, which is inconvenient. Plus, doing it in Python, make it a little bit to slow. > > My first attempt was to ask the pgsql-hackers ML for extending the COPY TO syntax to allow a “FORCE_EOL” parameter, butthey kindly rejected my proposal. They also suggested to me to use the result of PQgetCopyData() and convert there theLF character with whatever is suitable for me. > > So I studied the psycopg codebase and spotted out where and how to change it to allow such an use case. My intent was toadd a new keyword argument to the copy_expert() method, let me call it “eol” with a default of “\n”. If the user decidesto override it using a different EOL (i.e. “\r\n” or “\r”) every EOL returned by PQgetCopyData() in _pq_copy_out_v3()can be converted. > > However I’m a little bit concerned with this solution, and before going on with a pull request, I’d like to have your feedbackhere. My main concern is that extending the copy_expert() method in psycopg leaves the user completely alone aboutusing this new keyword argument in the right way. > > We can easily allow only CR, LF, and CRLF as the values for that argument, but what if the user uses the “eol” kwarg andfor example issues a “COPY TO … AS BINARY” query? In that case the resulting output file can end up being corrupted withoutthe user can even notice that. Of course psycopg can parse the “COPY TO” query (by means of PG’s ProcessCopyOptions())and check if the “eol” kwarg is consistent with the issued query. But, frankly this is seems to becomea little bit too complex to me. > > So I’m asking to you, what’s your take on this, what do you think about that? Do you see any better way to get it done?Anyone here also involved in pgsql-hackers ML can support my idea to extend the COPY TO syntax directly in PG? > > Thanks for you help, and apologies for the long email. Alright to follow up on my previous post about open. In Python 2 newline is available in the io module, so a simple example: f = io.open('io_newline.csv', 'w', newline='\r\n') cur = con.cursor() cur.copy_expert("COPY cell_per TO STDOUT WITH CSV HEADER", f) f.close() aklaver@panda:~/software_projects> file io_newline.csv io_newline.csv: ASCII text, with CRLF line terminators > a. > > > -- Adrian Klaver adrian.klaver@aklaver.com