Row estimation for "var <> const" and for "NOT (...)" queries - Mailing list pgsql-hackers
From | Nikolay Samokhvalov |
---|---|
Subject | Row estimation for "var <> const" and for "NOT (...)" queries |
Date | |
Msg-id | e431ff4c0804031419q6ec48531j8e8c88c8e0fddca4@mail.gmail.com Whole thread Raw |
Responses |
Re: Row estimation for "var <> const" and for "NOT (...)" queries
|
List | pgsql-hackers |
Hi, I have a table "table1" with ~100k rows, the table having "flag1" column. The value of "flag1" is NULL in 85k+ rows, and it's TRUE in 7k+ rows, and FALSE in 6k rows. I use EXPLAIN to get apprx. number of rows for simple SELECT queries. But in case of "...WHERE NOT flag1" the optimizer is completely wrong: -- it's OK here, the estimation is fine test=# EXPLAIN ANALYZE SELECT * FROM table1 WHERE flag1; QUERY PLAN -------------------------------------------------------------------------------------------------------------Seq Scan ontable1 (cost=0.00..9962.84 rows=7875 width=532) (actual time=0.107..134.729 rows=7652 loops=1) Filter: flag1Total runtime: 139.460 ms (3 rows) -- here optimizer thinks that we have 90k+ rows with "flag1 = FALSE", while the real number of rows is 6k+ test=# EXPLAIN ANALYZE SELECT * FROM table1 WHERE NOT flag1; QUERY PLAN --------------------------------------------------------------------------------------------------------------Seq Scan ontable1 (cost=0.00..9962.84 rows=91809 width=532) (actual time=0.087..110.596 rows=6243 loops=1) Filter: (NOT flag1)Total runtime: 114.414 ms (3 rows) I've checked statistics available and have found that Postgres actually knows how many FALSE values are present (approximately) in the table: test=# SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='table1' AND attname='flag1';null_frac | n_distinct | most_common_vals | most_common_freqs -----------+------------+------------------+------------------- 0.864667 | 2 | {t,f} | {0.079,0.0563333} (1 row) So, I've started to think that this is a shortcoming of the optimizer code, which makes Postgres count both FALSEs and NULLs when estimating "var <> const" expressions. 1) backend/utils/adt/selfuncs.c, in neqsel() we have: ... result = DatumGetFloat8(DirectFunctionCall4(eqsel, ... ... result = 1.0 - result; PG_RETURN_FLOAT8(result); ... -- so, there is a wrong assumption that for "var <> const" expressions we may just use estimation for "var = const" and subtract it from 1. In fact, NULLs are ignored here. According to ternary logic, in this case we must subtract the number of NULLs also. This will improve row estimation for "var <> const" queries (but not in case when we deal with boolean datatype, look at (2)!). If there are no objections, I'll send the patch, which is straightforward. 2). In case of "WHERE flag1 = FALSE" or "WHERE flag1 <> TRUE" the planner rewrites the query to "WHERE NOT flag1" and then uses the logic defined in backend/optimizer/path/clausesel.c, where, again, we see the wrong approach which ignores NULLs: ...else if (not_clause(clause)){ /* inverse of the selectivity of the underlying clause */ s1 = 1.0 - clause_selectivity(root, (Node *) get_notclausearg((Expr *) clause), varRelid, jointype); ... I have no idea how to improve this. AFAIKS, at this level we have no knowledge about the data we're dealing with (am I right?) -- so, I'm afraid that for booleans there is no way to improve the optimizer. If my thoughts described in (1) are correct and we improve the estimation for "<>", we will have a situation where using booleans might decrease the performance due to wrong rows count estimation. I'll appreciate any help and ideas that will allow to improve the situation. P. S. I use current HEAD version of Postgres; before running queries the statistic was updated with ANALYZE -- Best regards, Nikolay
pgsql-hackers by date: