Re: [psycopg] Turbo ODBC - Mailing list psycopg
From | Jim Nasby |
---|---|
Subject | Re: [psycopg] Turbo ODBC |
Date | |
Msg-id | 57df076e-d215-a962-94b5-496169401452@BlueTreble.com Whole thread Raw |
In response to | Re: [psycopg] Turbo ODBC (Adrian Klaver <adrian.klaver@aklaver.com>) |
Responses |
Re: [psycopg] Turbo ODBC
|
List | psycopg |
On 1/15/17 3:25 PM, Adrian Klaver wrote: > Have you looked at asyncpg: I'm pulling Matt Rocklin in, who I've been working with on some plpython improvements. Here's his offlist replies to me: > On 1/15/17 5:15 PM, Matthew Rocklin wrote: >> You might also contact Wes McKinney at check out >> Arrow: http://pyarrow.readthedocs.io/en/latest/ >> >> This project was designed for inter-system data interchange. It would >> help with conversion to Spark and other popular tools as well. >> >> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io >> <mailto:mrocklin@continuum.io>> wrote: >> >> I haven't taken a look at asyncpg. >> >> Quick note, you might want to be careful about mixing HDFS and >> HDF5. Postgres's competition for data storage is HDF5 not HDFS >> (Which stands for the Hadoop File System) >> >> I still think that the best thing to do here would be to dump out >> python arrays for each of the columns in a result set. I suspect >> that you could beat /any/ system that doesn't do this handily. This >> would avoid any performance pain of building up Python objects, and >> can continue to use just the standard library. >> >> You can stick to Pure Python and still send arrays encoded with >> bytes. We can convert these to NumPy or Pandas trivially with zero >> overhead. >> >> In [1]: from array import array # this is in the standard library >> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring >> In [3]: a = array('L', b) # eight byte unsigned integer >> In [4]: a >> Out[4]: array('L', [0, 0]) >> >> We would only be bound by >> >> 1. The cost in postgres to tranpose the data from row-major to >> column major >> 2. The cost to move bytes across a socket >> >> This should run at gigabytes per second. Everything else is likely >> to be competing around to 50-100MB/s range at top speed is my guess. My guess is that the big gains for external communication would come from: 1) reducing latency effects 2) reducing data marshaling 1 could be accomplished in various ways, and some changes to the FE-BE protocol might be helpful. If we can demonstrate a clear win we might be able to get protocol changes. I do suspect this needs to be tune-able though. My 4x improvement to plpy.execute is changing from the equivalent to a large batch to single row operation. Moving data to python over a local filesystem socket would be hurt be a large batch size, while something on a high latency connection would presumably benefit from moderately large batches since that reduces overall TCP overhead. I think 2 would necessitate FE-BE changes, although some of that could be tested without the changes. Right now, everything is going to get marshaled into a simple text format (ie: ints become a string of digits), but for test purposes there's ways you could run that through the binary output functions (in the case of int4, you'd get 4 bytes in network order, which should be faster to handle (and could be passed directly to something like ndarray). The two problems you run into with the array type are handling NULLs and building columns from rows. You could allow the option of specifying that NULLs won't be allowed in specific fields, or you could use something like a Pandas Series that provides other ways of handling the equivalent of NULL. For the row to column conversion, there's simply no way to get around the pattern of appending items one at a time to an array-like container (list, ndarray, Series, etc), and dealing with the memory allocation problem. I've looked at python's list code and it's certainly no worse at that then what Postgres would do. The big reason I would do that work in python though is you then have the ability to use an number of types for that; they just have to support the equivalent of append(). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)