Re: Unexpected query plan results - Mailing list pgsql-performance
From | Anne Rosset |
---|---|
Subject | Re: Unexpected query plan results |
Date | |
Msg-id | 4A254239.7070306@collab.net Whole thread Raw |
In response to | Re: Unexpected query plan results (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Unexpected query plan results
|
List | pgsql-performance |
Robert Haas wrote: >On Mon, Jun 1, 2009 at 4:53 PM, Anne Rosset <arosset@collab.net> wrote: > > >>>On Mon, Jun 1, 2009 at 2:14 PM, Anne Rosset <arosset@collab.net> wrote: >>> >>> >>>>SELECT SUM(1) FROM item WHERE is_deleted = 'f'; sum --------- 1824592 (1 >>>>row) >>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 >>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>'; sum -------- >>>>122412 (1 row) >>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 >>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted >>>>= >>>>'f'; sum ----- 71 (1 row) >>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 >>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted >>>>= >>>>'t'; sum -------- 122341 (1 row) >>>> >>>> >>The item table has 2324829 rows >> >> > >So 1824592/2324829 = 78.4% of the rows have is_deleted = false, and >0.06709% of the rows have the relevant folder_id. Therefore the >planner assumes that there will be 2324829 * 78.4% * 0.06709% =~ >96,000 rows that satisfy both criteria (the original explain had >97,000; there's some variability due to the fact that the analyze only >samples a random subset of pages), but the real number is 71, leading >it to make a very bad decision. This is a classic "hidden >correlation" problem, where two columns are correlated but the planner >doesn't notice, and you get a terrible plan. > >Unfortunately, I'm not aware of any real good solution to this >problem. The two obvious approaches are multi-column statistics and >planner hints; PostgreSQL supports neither. There are various >possible hacks that aren't very satisfying, such as: > >1. Redesign the application to put the deleted records in a separate >table from the non-deleted records. But if the deleted records still >have child records in other tables, this won't fly due to foreign key >problems. > >2. Inserting a clause that the optimizer doesn't understand to fool it >into thinking that the scan on the item table is much more selective >than is exactly the case. I think adding (item.id + 0) = (item.id + >0) to the WHERE clause will work; the planner will brilliantly >estimate the selectivity of that expression as one in 200. The >problem with this is that it will likely lead to a better plan in this >particular case, but for other folder_ids it may make things worse. >There's also no guarantee that a future version of PostgreSQL won't be >smart enough to see through this type of sophistry, though I think >you're safe as far as the forthcoming 8.4 release is concerned. > >3. A hack that makes me gag, but it actually seems to work... > >CREATE OR REPLACE FUNCTION item_squash(varchar, boolean) RETURNS varchar[] AS $$ >SELECT array[$1, CASE WHEN $2 THEN 'true' ELSE 'false' END] >$$ LANGUAGE sql IMMUTABLE; > >CREATE INDEX item_squash_idx ON item (item_squash(folder_id, is_deleted)); > >...and then remove "folder_id = XXX AND is_deleted = YYY" from your >query and substitute "item_squash(folder_id, is_deleted) = >item_squash(XXX, YYY)". The expresson index forces the planner to >gather statistics on the distribution of values for that expression, >and if you then write a query using that exact same expression the >planner can take advantage of it. > >...Robert > > Thanks a lot Robert. Not sure how we will tackle this but at least now we have an explanation. From what I read, results won't improved in 8.4. Is that correct? Thanks, Anne
pgsql-performance by date: