Home > mailing lists

Re: weird execution plan - Mailing list pgsql-performance

From	David G Johnston
Subject	Re: weird execution plan
Date	September 12, 2014 21:34:36
Msg-id	1410557668457-5818905.post@n5.nabble.com Whole thread Raw
In response to	Re: weird execution plan ("Huang, Suya" <Suya.Huang@au.experian.com>)
Responses	Re: weird execution plan
List	pgsql-performance

Tree view

Huang, Suya wrote
> Both queries have been run several times so cache would have same effect
> on both of them?  Below is the plan with buffer information.

Not everyone does so its nice to make certain - especially since I'm not all
that familiar with the code involved.  But since no one else has answered I
will theorize.

SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl )

vs

SELECT count(DISTINCT col) FROM tbl

The code for "SELECT DISTINCT col" is likely highly efficient because it
works on complete sets of records.

The code for "SELECT count(DISTINCT col)" is at a relative disadvantage
since it must evaluate one row at a time and remember whether it had seen
the same value previously before deciding whether to increment a counter.

With a large number of duplicate rows the process of making the row set
smaller before counting the end result will perform better since fewer rows
must be evaluated in the less efficient count(DISTINCT) expression - the
time saved there more than offset by the fact that you are effectively
passing over that subset of the data a second time.

HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows)

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/weird-execution-plan-tp5818730p5818905.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

pgsql-performance by date:

From: "Huang, Suya"
Date: 12 September 2014, 03:41:43
Subject: Re: weird execution plan

From: "Huang, Suya"
Date: 15 September 2014, 00:39:58
Subject: Re: weird execution plan

Re: weird execution plan - Mailing list pgsql-performance

Previous

Next