Re: Slow count(*) again... - Mailing list pgsql-performance
From | Reid Thompson |
---|---|
Subject | Re: Slow count(*) again... |
Date | |
Msg-id | 4CB1D588.4080503@ateb.com Whole thread Raw |
In response to | Re: Slow count(*) again... (Neil Whelchel <neil.whelchel@gmail.com>) |
Responses |
Re: Slow count(*) again...
|
List | pgsql-performance |
On 10/10/2010 6:29 AM, Neil Whelchel wrote: > On Saturday 09 October 2010 23:56:15 Craig Ringer wrote: >> On 10/10/2010 11:02 AM, Neil Whelchel wrote: >>> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: >>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> >>> wrote: >>>>> I know that there haven been many discussions on the slowness of >>>>> count(*) even when an index is involved because the visibility of the >>>>> rows has to be checked. In the past I have seen many suggestions about >>>>> using triggers and tables to keep track of counts and while this works >>>>> fine in a situation where you know what the report is going to be >>>>> ahead of time, this is simply not an option when an unknown WHERE >>>>> clause is to be used (dynamically generated). I ran into a fine >>>>> example of this when I was searching this mailing list, "Searching in >>>>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL >>>>> 8.3." Obviously at some point count(*) came into play here because the >>>>> site made a list of pages (1 2 3 4 5 6> next). I very commonly make a >>>>> list of pages from search results, and the biggest time killer here is >>>>> the count(*) portion, even worse yet, I sometimes have to hit the >>>>> database with two SELECT statements, one with OFFSET and LIMIT to get >>>>> the page of results I need and another to get the amount of total rows >>>>> so I can estimate how many pages of results are available. The point I >>>>> am driving at here is that since building a list of pages of results >>>>> is such a common thing to do, there need to be some specific high >>>>> speed ways to do this in one query. Maybe an estimate(*) that works >>>>> like count but gives an answer from the index without checking >>>>> visibility? I am sure that this would be good enough to make a page >>>>> list, it is really no big deal if it errors on the positive side, >>>>> maybe the list of pages has an extra page off the end. I can live with >>>>> that. What I can't live with is taking 13 seconds to get a page of >>>>> results from 850,000 rows in a table. > Good point, maybe this is turning more into a discussion of how to generate a > list of pages of results and one page of results with one query so we don't > have to do the same painfully slow query twice to do a very common task. > > On the other hand, I copied a table out of one of my production servers that > has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp, > text). The first numeric column has numbers evenly spread between 0 and 100 > and it is indexed. I put the table in a pair of database servers both running > on the same physical hardware. One server is Postgres, the other is a popular > server (I am not mentioning names here). on Postgres: SELECT count(*) FROM > table where column>50; takes about 8 seconds to run. The other database server > took less than one second (about 25 ms) as it is using the index (I assume) to > come up with the results. It is true that this is not a fair test because both > servers were tested with their default settings, and the defaults for Postgres > are much more conservative, however, I don't think that any amount of settings > tweaking will bring them even in the same ball park. There has been discussion > about the other server returning an incorrect count because all of the indexed > rows may not be live at the time. This is not a problem for the intended use, > that is why I suggested another function like estimate(*). It's name suggests > that the result will be close, not 100% correct, which is plenty good enough > for generating a list of results pages in most cases. I am faced with a very > serious problem here. If the query to make a list of pages takes say 6 seconds > and it takes another 6 seconds to generate a page of results, the customer is > waiting 12 seconds. This is not going to work. If count made a quick estimate, > say less than a second, and it took 6 seconds to come up with the actual > results, I could live with that. Or if coming up with the window of results > via (OFFSET and LIMIT) and returned the total number of rows that would have > matched the query, then I would still have everything I need to render the > page in a reasonable time. I really think that this needs to be addressed > somewhere. It's not like I am the only one that does this. You see it nearly > everywhere a long list of results is (expected to be) returned in a web site. > Among the people I work with, this seems to be the most mentioned reason that > they claim that they don't use Postgres for their projects. > > It would be nice to see how the server comes up with the search results and > list of links to pages of results for this mailing list. > (http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am > guessing that it probably uses the count and query method I am talking about. > >> I guess there could be *some* way to expose the query planner's cost >> estimates in a manner useful for result count estimation ... but given >> how coarse its stats are and how wildly out the estimates can be, I kind >> of doubt it. It's really intended for query planning decisions and more >> interested in orders of magnitude, "0, 1, or more than that" measures, >> etc, and seems to consider 30% here or there to be pretty insignificant >> most of the time. >> >>> It's bad enough that count(*) is slow, then you have to do it all over >>> again to get the results you need! I have not dug into this much yet, >>> but would it be possible to return the amount of rows that a WHERE >>> clause would actually return if the LIMIT and OFFSET were not applied. >>> IE: When a normal query is executed, the server returns the number of >>> rows aside from the actual row data. Would it be a big deal to modify >>> this to allow it to return the amount of rows before the LIMIT and >>> OFFSET is applied as well? >> It'd force the server to fully execute the query. Then again, it sounds >> like you're doing that anyway. How big is your DB? How fast is your disk access? Any chance disks/RAM can be addressed? My disk access is pitiful... first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real 0m2.55s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2340704 (1 row) real 0m35.38s user 0m0.25s sys 0m0.03s subsequent runs.... (count changes due to inserts.) rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.70s user 0m0.27s sys 0m0.02s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.55s user 0m0.26s sys 0m0.02s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.50s user 0m0.26s sys 0m0.02s reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table')); pg_size_pretty ---------------- 1890 MB (1 row)
pgsql-performance by date: