Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | 11223173-bd90-4fa8-8cf1-49928b615405@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 1/8/19 3:18 PM, Dean Rasheed wrote: > On Mon, 7 Jan 2019 at 00:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> FWIW the main unsolved issue (at least on the MCV part) is how it >> decides which items to keep in the list. >> >> As explained in [1], in the multivariate case we can't simply look at >> the group frequency and compare it to the average frequency (of the >> non-MCV items), which is what analyze_mcv_list() does in the >> single-column case. In the multivariate case we also case about observed >> vs. base frequency, i.e. we want the MCV list to include groups that are >> present singificantly more/less than product of per-column stats. >> >> I've repeatedly tried to come up with a criteria that would address >> that, but it seems rather difficult because we can't abandon the other >> criteria either. So the MCV list should include groups that match both >> >> (a) items that are statistically more common than the non-MCV part (i.e. >> the rule from per-column analyze_mcv_list) >> >> (b) items that are statistically more/less common than estimated from >> per-column stats (i.e. the new rule) > > Thinking about this some more, I think that it probably isn't > appropriate to use analyze_mcv_list() directly because that's making > specific assumptions about how items not in the list will be estimated > that aren't actually true for groups of values in multivariate stats. > If a group of values isn't in the MCV list, it gets estimated based on > the product of the selectivities from the per-column stats (modulo the > additional logic preventing the selectivity not exceeding the total > non-MCV selectivity). > > So actually, the estimate for a group of values will be either the MCV > item's frequency (if the MCV item is kept), or (roughly) the MCV > item's base_frequency (if the MCV item is not kept). That suggests > that we should simply keep items that are significantly more or less > common than the item's base frequency -- i.e., keep rule (b) and ditch > rule (a). > Hmmm, but won't that interfere with how we with how we extrapolate the MCV estimate to the non-MCV part? Currently the patch does what you proposed, i.e. other_sel = simple_sel - mcv_basesel; I'm worried that if we only include the items that are significantly more or less common than the base frequency, it may skew the other_sel estimate. >> Enforcing rule (a) seems reasonable because it ensures the MCV list >> includes all items more frequent than the last one. Without it, it's >> difficult to decide know whether the absent item is very common (but >> close to base frequency) or very uncommon (so less frequent than the >> last MCV item). > > I'm not sure there's much we can do about that. Keeping the item will > result in keeping a frequency that we know is close to the base > frequency, and not keeping the item will result in per-column stats > being used that we expect to also give an estimate close to the base > frequency. So either way, the result is about the same, and it's > probably better to discard it, leaving more room for other items about > which we may have more information. > > That said, there is a separate benefit to keeping items in the list > even if their frequency is close to the base frequency -- the more > items kept, the larger their total selectivity will be, giving a > better cap on the non-MCV selectivities. So if, after keeping all > items satisfying rule (b), there are free slots available, perhaps > they should be used for the most common remaining values satisfying > rule (a). > Hmm, so essentially we'd use (b) first to bootstrap the MCV list, and then we could do what analyze_mcv_list() does. That could work, I guess. The question is how to define "significantly different from base freq" though. Any ideas? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: