Home > mailing lists

Re: Performance Issues with count() - Mailing list pgsql-general

From	S Grannis
Subject	Re: Performance Issues with count()
Date	April 25, 2002 14:55:27
Msg-id	20020425185441.10228.qmail@email.com Whole thread Raw
In response to	Performance Issues with count() ("asdf asdasfa" <sjg@email.com>)
Responses	Re: Performance Issues with count() Re: Performance Issues with count() Performance HOWTO - pseudo counter example
List	pgsql-general

Tree view

Thanks for all of the useful comments.

Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to
2hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time. 

Others have suggested the "fix" is in the future.

Stephan Szabo wrote:
"I didn't see anything about your settings in postgresql.conf,
but increasing the sort_mem parameter may help that really
expensive sort step. I think the desired fix for this would
probably be the TODO entry on hash based aggregates but that's
still in the future..."

The "non-default" postgresql.conf settings are as follows:

shared_buffers = 240000 # uses ~2GB of shared mem
sort_mem = 512
wal_files = 64
enable_seqscan = 0      # per a recommendation
enable_indexscan = true
enable_tidscan = true
enable_sort = true
enable_nestloop = true
enable_mergejoin = true
enable_hashjoin = true

I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl
scriptto do the actual counting. 

Thanks again for the feedback,

Shaun Grannis

----- Original Message -----
From: Michael Loftis <mloftis@wgops.com>
Date: Wed, 24 Apr 2002 10:14:04 -0700
To: asdf asdasfa <sjg@email.com>
Subject: Re: [GENERAL] Performance Issues with count()


> Can you humour me and
> set enable_seqscan=0
> And retry the query?
>
> Thanks :)
>
> S Grannis wrote:
>
> > Hi,
> >
> > I've found some performance issues with Postgres that
> > I'm hoping people on this list can help resolve. We're
> > working with a 65 million record table that includes year
> > of birth (data type INT). To count the frequency of dates
> > in the table, it takes 2 hours 26 minutes to execute.
> > (There's an approximately 100-year range of dates in the
> > 65 million records).
> >
> > # EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
> >
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
> > -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
> > -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
> > -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
> > I can count data from the flat text data file with this
> > Perl script:
> >
> > #!/usr/bin/perl
> > # script to count YB frequencies in flat data file
> > open (IN, "$ARGV[0]");
> > open (OUT, ">$ARGV[0]\_cnt");
> > while (<IN>) {
> > chomp;
> > $years{$_}++;}
> > foreach $key (keys %years) {
> > print OUT "$key,$years{$key}\n";}
> >
> > The Perl script takes *1 minute*, 31 seconds to run.
> > Why is there such a discrepancy in times? I've noticed
> > that the Postgres count() function takes what seems to
> > be "longer than it should" in other cases as well. For
> > instance, counting the frequency of last names in the
> > same 65 million record table took *1 hour* and 31
> > minutes:
> >
> > # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
> > -> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
> > -> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
> > -> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
> >
> > The last name ( ln) and the year of birth ( yb) is
> > indexed, but that shouldn't matter because it's doing a
> > sequential scan, correct? Am I running into the
> > limitations of Postgres? We'd like to eventually get this
> > system into production, but if we can't get Postgres to
> > count() faster, we may not be able to use it.
> >
> > Here's the data_table schema:
> >
> > # \d data_table
> >
> > Table "data_table"
> >  Column | Type          | Modifiers
> > --------+---------------+-----------
> >      ss | character(9)  |
> >      ln | character(15) |
> >      fn | character(15) |
> >      mi | character(1)  |
> >      ns | character(15) |
> >     lny | character(15) |
> >     fny | character(15) |
> >     sny | character(15) |
> >       g | character(1)  |
> >      mb | integer       |
> >      db | integer       |
> >      yb | integer       |
> >      md | integer       |
> >      dd | integer       |
> >      yd | integer       |
> > Indexes: ssdi_ss_idx
> >          ssdi_ln_idx
> >
> > We're working with Postgres v 7.2. The machine is a
> > dual-processor Athlon MP1900 (Tyan Tiger board) with
> > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
> > configured in a software RAID 0 Array running under
> > RedHat Linux v. 7.2.
> >
> > We've VACUUM ANALYZE'd the tables after creating the
> > indices. Is there something I'm missing here?
> >
> > Thanks for your suggestions.
> >
> > Shaun Grannis
--

_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com
http://www.email.com/?sr=signup

pgsql-general by date:

From: tony
Date: 25 April 2002, 14:42:19
Subject: Re: Look for a French user to help here : Technique doc

From: Neil Conway
Date: 25 April 2002, 15:04:02
Subject: Re: Performance Issues with count()

Re: Performance Issues with count() - Mailing list pgsql-general

Previous

Next