Home > mailing lists

Re: improving GROUP BY estimation - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: improving GROUP BY estimation
Date	March 18, 2016 00:38:05
Msg-id	715c5c66-5fe1-a9aa-e28d-6c4114ba0fd6@2ndquadrant.com Whole thread Raw
In response to	Re: improving GROUP BY estimation (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: improving GROUP BY estimation
List	pgsql-hackers

Tree view

On 03/13/2016 11:09 PM, Tomas Vondra wrote:
> Hi,
>
> On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
>> On 4 March 2016 at 13:10, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>>>
...
 >>>
>>
>> I think that a better formula to use would be
>>
>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
>> reldistinct)

Attached is a v3 of the patch using this formula instead of the original
one. Interestingly, that apparently reduces the number of regression
tests that get broken to a single one.

I'm not sure whether we need to provide a link to the PDF the formula
comes from - perhaps we should?

I've also repeated the tests for the two tables (dependent and
independent columns), comparing the actual number of groups and
different estimates, and the results look like this (v3 is the formula
used in this patch):


1) independent

                |   10 |   50 |  100 |  500 |  1000 |  5000
    ---------------------------------------------------------
         actual |  919 | 3829 | 6244 | 9944 | 10001 | 10001
        current |   10 |   50 |  102 |  516 |  1018 |  4996
       new (v1) |  973 | 4001 | 6382 | 9897 |  9951 |  9951
       new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004


2) dependent

                 |  10 |   50 |  100 |  500 |  1000 |  5000
      --------------------------------------------------------
          actual |  10 |   50 |  100 |  500 |  1000 |  5000
         current |  10 |   53 |  105 |  508 |  1016 |  5014
        new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018
        new (v3) | 807 | 3680 | 6050 | 9916 |  9983 |  9983

I only collected numbers for the new estimator, the other numbers are
just a copy from the previous message. So there might be minor
differences due to slightly different ndistinct estimates etc.

Anyway, the numbers are obviously quite close to the formula from v1 of
the patch, plus the formula gives better estimates when scanning nearly
all rows.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

estimate-num-groups-v3.patch

pgsql-hackers by date:

From: Michael Paquier
Date: 18 March 2016, 00:35:24
Subject: Re: [GENERAL] Request - repeat value of \pset title during \watch interations

From: "David G. Johnston"
Date: 18 March 2016, 00:39:32
Subject: Re: Make primnodes.h gender neutral

Re: improving GROUP BY estimation - Mailing list pgsql-hackers

Attachment

Previous

Next