Re: Performance Issues with count() - Mailing list pgsql-general
From | S Grannis |
---|---|
Subject | Re: Performance Issues with count() |
Date | |
Msg-id | 20020425185441.10228.qmail@email.com Whole thread Raw |
In response to | Performance Issues with count() ("asdf asdasfa" <sjg@email.com>) |
Responses |
Re: Performance Issues with count()
Re: Performance Issues with count() Performance HOWTO - pseudo counter example |
List | pgsql-general |
Thanks for all of the useful comments. Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to 2hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. Others have suggested the "fix" is in the future. Stephan Szabo wrote: "I didn't see anything about your settings in postgresql.conf, but increasing the sort_mem parameter may help that really expensive sort step. I think the desired fix for this would probably be the TODO entry on hash based aggregates but that's still in the future..." The "non-default" postgresql.conf settings are as follows: shared_buffers = 240000 # uses ~2GB of shared mem sort_mem = 512 wal_files = 64 enable_seqscan = 0 # per a recommendation enable_indexscan = true enable_tidscan = true enable_sort = true enable_nestloop = true enable_mergejoin = true enable_hashjoin = true I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl scriptto do the actual counting. Thanks again for the feedback, Shaun Grannis ----- Original Message ----- From: Michael Loftis <mloftis@wgops.com> Date: Wed, 24 Apr 2002 10:14:04 -0700 To: asdf asdasfa <sjg@email.com> Subject: Re: [GENERAL] Performance Issues with count() > Can you humour me and > set enable_seqscan=0 > And retry the query? > > Thanks :) > > S Grannis wrote: > > > Hi, > > > > I've found some performance issues with Postgres that > > I'm hoping people on this list can help resolve. We're > > working with a 65 million record table that includes year > > of birth (data type INT). To count the frequency of dates > > in the table, it takes 2 hours 26 minutes to execute. > > (There's an approximately 100-year range of dates in the > > 65 million records). > > > > # EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb; > > > > NOTICE: QUERY PLAN: > > Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4) > > -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4) > > -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4) > > -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4) > > I can count data from the flat text data file with this > > Perl script: > > > > #!/usr/bin/perl > > # script to count YB frequencies in flat data file > > open (IN, "$ARGV[0]"); > > open (OUT, ">$ARGV[0]\_cnt"); > > while (<IN>) { > > chomp; > > $years{$_}++;} > > foreach $key (keys %years) { > > print OUT "$key,$years{$key}\n";} > > > > The Perl script takes *1 minute*, 31 seconds to run. > > Why is there such a discrepancy in times? I've noticed > > that the Postgres count() function takes what seems to > > be "longer than it should" in other cases as well. For > > instance, counting the frequency of last names in the > > same 65 million record table took *1 hour* and 31 > > minutes: > > > > # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln; > > NOTICE: QUERY PLAN: > > Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19) > > -> Group (cost=19538149.27..19701206.48 rows=65222884 width=19) > > -> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19) > > -> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19) > > > > The last name ( ln) and the year of birth ( yb) is > > indexed, but that shouldn't matter because it's doing a > > sequential scan, correct? Am I running into the > > limitations of Postgres? We'd like to eventually get this > > system into production, but if we can't get Postgres to > > count() faster, we may not be able to use it. > > > > Here's the data_table schema: > > > > # \d data_table > > > > Table "data_table" > > Column | Type | Modifiers > > --------+---------------+----------- > > ss | character(9) | > > ln | character(15) | > > fn | character(15) | > > mi | character(1) | > > ns | character(15) | > > lny | character(15) | > > fny | character(15) | > > sny | character(15) | > > g | character(1) | > > mb | integer | > > db | integer | > > yb | integer | > > md | integer | > > dd | integer | > > yd | integer | > > Indexes: ssdi_ss_idx > > ssdi_ln_idx > > > > We're working with Postgres v 7.2. The machine is a > > dual-processor Athlon MP1900 (Tyan Tiger board) with > > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives > > configured in a software RAID 0 Array running under > > RedHat Linux v. 7.2. > > > > We've VACUUM ANALYZE'd the tables after creating the > > indices. Is there something I'm missing here? > > > > Thanks for your suggestions. > > > > Shaun Grannis -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Email.com http://www.email.com/?sr=signup
pgsql-general by date: