Thread: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

From

Jesper Krogh

Date:

17 February 2011, 17:25:07

Hi All.

The NULL element always suprises me in unpleasant ways..  my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing
that instead.

I have a table with a "null_frac" of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where  null @@ to_tsquery('testterm80');
  id
----
(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80');
                           QUERY PLAN
---------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..1985.03 rows=1966 width=4)
    Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the "fts is not null"
clause to the query.

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80') and fts is not null;
                                      QUERY PLAN
-------------------------------------------------------------------------------------
  Bitmap Heap Scan on testtable  (cost=130.34..1735.19 rows=983 width=4)
    Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS
NOT NULL))
    ->  Bitmap Index Scan on testtable_fts_idx  (cost=0.00..130.09
rows=983 width=0)
          Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts
IS NOT NULL))
(4 rows)

When something evaluates to "null" isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the < operator seem to
take the null_frac into
account.

Below snippet allows to reproduce the dataset.


create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
  RETURNS boolean
  LANGUAGE plpgsql
AS $function$
     DECLARE
         count integer;
     BEGIN
         count := 0;
         LOOP
             EXIT WHEN count = rows;
             count := count +1;
             insert into testtable(document,fts) select
document,to_tsvector('english',document) from (select
string_agg(concat,' ') as document from (select concat('testterm' ||
generate_series(1,floor(random()*100)::integer))) as foo) as bar;
         END LOOP;
         RETURN TRUE;
     END;
$function$

select filltable(10000);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 1835 of 1835 pages, containing 10002 live
rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
  null_frac
-----------
        0.5
(1 row)

... trying with integers:

testdb=# ALTER  TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 2186 of 2186 pages, containing 10002 live
rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 2282 of 2282 pages, containing 10002 live
rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows
analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
                          QUERY PLAN
-------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=64 width=4)
    Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
                          QUERY PLAN
-------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=48 width=4)
    Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint < 50;
                           QUERY PLAN
---------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=2470 width=4)
    Filter: (testint < 50)
(2 rows)


(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper

Attachment

0001-Take-null_frac-into-account-in-calculating-selectivi.patch

Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

From

Tom Lane

Date:

17 February 2011, 18:20:59

Jesper Krogh <jesper@krogh.cc> writes:
> When something evaluates to "null" isn't included in the result, 
> shouldn't the query-planner
> then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions.  IOW, aren't you
double-counting the null fraction?
        regards, tom lane

Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

From

Jesper Krogh

Date:

17 February 2011, 18:35:36

On 2011-02-17 23:20, Tom Lane wrote:
> Jesper Krogh<jesper@krogh.cc>  writes:
>> When something evaluates to "null" isn't included in the result,
>> shouldn't the query-planner
>> then take the null_frac into account when computing the estimate?
> The proposed patch seems wrong to me: if we're estimating on the basis
> of most-common-value fractions, the null_frac is already accounted for,
> because it's not part of the MCV selectivity fractions.  IOW, aren't you
> double-counting the null fraction?
It might be the wrong place to fix, but here it seems like we're only
counting MCE-freqs based on non-null elements:


http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

And the testdata confirms the behaviour.

-- 
Jesper

Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

From

Tom Lane

Date:

17 February 2011, 19:35:46

Jesper Krogh <jesper@krogh.cc> writes:
> On 2011-02-17 23:20, Tom Lane wrote:
>> The proposed patch seems wrong to me: if we're estimating on the basis
>> of most-common-value fractions, the null_frac is already accounted for,
>> because it's not part of the MCV selectivity fractions.  IOW, aren't you
>> double-counting the null fraction?

> It might be the wrong place to fix, but here it seems like we're only
> counting MCE-freqs based on non-null elements:
>
http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

Hmm, you're right, and the specification in pg_statistic.h neglects to
say that.  This does need work.
        regards, tom lane

Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

From

Tom Lane

Date:

17 February 2011, 20:01:57

Jesper Krogh <jesper@krogh.cc> writes:
> Attached patch tries to align the behaviour

Applied with a bit of editorialization.
        regards, tom lane