Home > mailing lists

Performance Issues with count() - Mailing list pgsql-general

From	asdf asdasfa
Subject	Performance Issues with count()
Date	April 23, 2002 18:06:54
Msg-id	20020423220654.6998.qmail@email.com Whole thread Raw
Responses	Re: Performance Issues with count()
List	pgsql-general

Tree view

Hi,
 
Ive found some performance issues with Postgres that Im hoping people on this list can help resolve.  Were
workingwith a 65 million record table that includes year of birth (data type INT). To count the frequency of dates in
thetable, it takes 2 hours 26 minutes to execute. (Theres an approximately 100-year range of dates in the 65 million
records).
 
# EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
NOTICE:  QUERY PLAN:
Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
  -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
      -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
          -> Seq Scan on data_table (cost=0.00..2368620.84
rows=65222884width=4) 
 
I can count data from the flat text data file with this Perl script:
 
   #!/usr/bin/perl
   # script to count YB frequencies in flat data file
   open (IN, "$ARGV[0]");
   open (OUT, ">$ARGV[0]\_cnt");
   while (<IN>) {
       chomp;
       $years{$_}++;
   }
   foreach $key (keys %years) {
       print OUT "$key,$years{$key}\n";
   }
 
The Perl script takes *1 minute*, 31 seconds to run.  Why is there such a discrepancy in times?  Ive noticed
thatthe Postgres count() function takes what seems to be longer than it should in other cases as well.  For
instance,counting the frequency of last names in the same 65 million record table took *1 hour* and 31 minutes: 
 
# EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
NOTICE:  QUERY PLAN:
Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
  ->  Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
       ->  Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
            ->  Seq Scan on data_table
(cost=0.00..2324610.84rows=65222884 width=19) 
 
The last name (ln) and the year of birth (yb) is indexed, but that shouldnt matter because its doing a sequential
scan,correct?  Am I running into the limitations of Postgres?  Wed like to eventually get this system into
production,but if we cant get Postgres to count() faster, we may not be able to use it. 
 
Heres the data_table schema:
 
# \d data_table
         Table "data_table"
 Column |     Type      | Modifiers
--------+---------------+-----------
 ss     | character(9)  |
 ln     | character(15) |
 fn     | character(15) |
 mi     | character(1)  |
 ns     | character(15) |
 lny    | character(15) |
 fny    | character(15) |
 sny    | character(15) |
 g      | character(1)  |
 mb     | integer       |
 db     | integer       |
 yb     | integer       |
 md     | integer       |
 dd     | integer       |
 yd     | integer       |
Indexes: ssdi_ss_idx
         ssdi_ln_idx
 
 
Were working with Postgres v 7.2. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with 3GB of PC2100
DDRRAM, and 3-80GB IBM 120GXP hard drives configured in a software RAID 0 Array running under RedHat Linux v. 7.2. 
 
Weve VACUUM ANALYZEd the tables after creating the indices.  Is there something Im missing here?
 
Thanks for you suggestions.
 
Shaun Grannis
--


_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com

pgsql-general by date:

From: Bruce Momjian
Date: 23 April 2002, 16:58:55
Subject: Re: Returns from ExecTuplesOk()

From: "Robert J. Sanford, Jr."
Date: 23 April 2002, 18:13:51
Subject: client authentication protocols?

Performance Issues with count() - Mailing list pgsql-general

Previous

Next