Re: Simple (hopefully) throughput question? - Mailing list pgsql-performance
From | Andy Colson |
---|---|
Subject | Re: Simple (hopefully) throughput question? |
Date | |
Msg-id | 4CD19E84.7070204@squeakycode.net Whole thread Raw |
In response to | Simple (hopefully) throughput question? (Nick Matheson <Nick.D.Matheson@noaa.gov>) |
Responses |
Re: Simple (hopefully) throughput question?
Re: Simple (hopefully) throughput question? |
List | pgsql-performance |
On 11/3/2010 10:52 AM, Nick Matheson wrote: > Hello > > We have an application that needs to do bulk reads of ENTIRE > Postgres tables very quickly (i.e. select * from table). We have > observed that such sequential scans run two orders of magnitude slower > than observed raw disk reads (5 MB/s versus 100 MB/s). Part of this is > due to the storage overhead we have observed in Postgres. In the > example below, it takes 1 GB to store 350 MB of nominal data. However > that suggests we would expect to get 35 MB/s bulk read rates. > > Observations using iostat and top during these bulk reads suggest > that the queries are CPU bound, not I/O bound. In fact, repeating the > queries yields similar response times. Presumably if it were an I/O > issue the response times would be much shorter the second time through > with the benefit of caching. > > We have tried these simple queries using psql, JDBC, pl/java stored > procedures, and libpq. In all cases the client code ran on the same > box as the server. We have experimented with Postgres 8.1, 8.3 and 9.0. > > We also tried playing around with some of the server tuning parameters > such as shared_buffers to no avail. > > Here is uname -a for a machine we have tested on: > > Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 > 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux > > A sample dataset that reproduces these results looks like the following > (there are no indexes): > > Table "bulk_performance.counts" > Column | Type | Modifiers > --------+---------+----------- > i1 | integer | > i2 | integer | > i3 | integer | > i4 | integer | > > There are 22 million rows in this case. > > We HAVE observed that summation queries run considerably faster. In this > case, > > select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts > > runs at 35 MB/s. > > Our business logic does operations on the resulting data such that > the output is several orders of magnitude smaller than the input. So > we had hoped that by putting our business logic into stored procedures > (and thus drastically reducing the amount of data flowing to the > client) our throughput would go way up. This did not happen. > > So our questions are as follows: > > Is there any way using stored procedures (maybe C code that calls > SPI directly) or some other approach to get close to the expected 35 > MB/s doing these bulk reads? Or is this the price we have to pay for > using SQL instead of some NoSQL solution. (We actually tried Tokyo > Cabinet and found it to perform quite well. However it does not measure > up to Postgres in terms of replication, data interrogation, community > support, acceptance, etc). > > Thanks > > Dan Schaffer > Paul Hamer > Nick Matheson > > I have no idea if this would be helpful or not, never tried it, but when you fire off "select * from bigtable" pg will create the entire resultset in memory (and maybe swap?) and then send it all to the client in one big lump. You might try a cursor and fetch 100-1000 at a time from the cursor. No idea if it would be faster or slower. -Andy
pgsql-performance by date: