PATCH: AM-specific statistics, with an example implementation for BRIN (WIP) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | PATCH: AM-specific statistics, with an example implementation for BRIN (WIP) |
Date | |
Msg-id | 7cfb21ed-ddd3-821b-a852-75fe42c64d49@enterprisedb.com Whole thread Raw |
Responses |
Re: PATCH: AM-specific statistics, with an example implementation for BRIN (WIP)
|
List | pgsql-hackers |
Hi, A couple days ago I posted a WIP patch [1] implementing "BRIN Sort", i.e. a node producing sorted output for BRIN minmax indexes. One of the challenges I mentioned in that thread is costing - that's actually not specific to that patch, it's an issue affecting BRIN in general, not just the proposed node, due to block-range indexes being very different from regular indexes with explicit tuple pointers. I mentioned I have some ideas how to improve this, and that I'll start a separate thread to discuss this. So here we go ... The traditional estimation problem is roughly this: Given a condition, how many rows will match it? That is, given a table with X rows, we need to estimate how many rows will match a WHERE condition, for example. And once we have the row estimate, we can estimate the amount of I/O, cost for sorting, etc. We have built fairly solid capability to calculate these estimates, using per-column statistics, extended statistics, ... The calculated estimates are not always perfect, but in general it works well. This affects all path types etc. mostly equally - yes, some paths are more sensitive to poor estimates (e.g. runtime may grow exponentially with increasing rowcount). BRIN indexes however add another layers to this - once we have estimated the number of rows, we need to estimate the number of pages ranges this maps to. You may estimate the WHERE condition to match 1000 rows, but then you need to decide if that's 1 page range, 1000 page ranges or possibly even all page ranges for the table. It all depends on how "correlated" the data is with physical position in the table. If you have perfectly correlated data, it may be enough to scan a single page. If it's random, you may need to scan everything. The existing costing uses the column correlation statistics, but sadly that's rather insensitive to outlier values. If you have a sequential table, and then set 1% of data to min/max (making the ranges very wide), the correlation will remain very close to 1.0, but you'll have to scan all the ranges (and the costing won't reflect that). The "BRIN sort" patch needs to estimate a different thing - given a page range, how many other page ranges overlap with it? This is roughly the amount of stuff we'll need to scan and sort in order to produce the first row. These are all things we can't currently estimate - we have some rough heuristics, but it's pretty easy to confuse those. Therefore, I propose to calculate a couple new statistics for BRIN indexes (assume minmax indexes, unless mentioned otherwise): 1) average number of overlapping ranges --------------------------------------- Given a range, with how many ranges it overlaps? In a perfectly sequential table this will be 0, so if you have a value you know it'll match just one range. In random table, it'll be pretty close to the number of page ranges. This can be calculated by simply walking the ranges, sorted by minval (see brin_minmax_count_overlaps). 2) average number of matching ranges for a value ------------------------------------------------ Given a value, how many ranges it matches? This can be calculated by matching sampled rows to ranges (brin_minmax_match_tuples_to_ranges). For minmax indexes this is somewhat complementary to the average number of overlaps, the relationship is roughly this: avg(# of matching ranges) = 1 + avg(number of overlapping ranges)/2 The intuition is that if you assume a range randomly overlapped by other ranges, you're likely to hit about 1/2 of them. The reason why we want to calculate both (1) and (2) is that for other opclasses the relationship is not that simple. For bloom opclasses we probably can't calculate overlaps at all (or at least not that easily), so the average number of matches is all we have. For minmax-multi, the overlaps will probably use only the min/max values, ignoring the "gaps", but the matches should use the gaps. 3) a bunch of other simple statistics ------------------------------------- These are number of summarized / not-summarized ranges, all_nulls and has_nulls ranges, which is useful to estimate IS NULL conditions etc. The attached patch implements a PoC of this. There's a new GUC (enable_indexam_stats) that can be used to enable/disable this (both the ANALYZE and costing part). By default it's "off" so make sure to do SET enable_indexam_stats = true; The statistics is stored in pg_statistics catalog, in a new staindexam column (with bytea). The opclasses can implement a new support procedure, similarly to what we do of opclass options. There's a couple of wrinkles (should be explained in XXX comments), but in principle this works. The brin_minmax_stats procedure implements this for minmax opclasses, calculating the stuff mentioned above. I've been experimenting with different ways to calculate some of the stuff, and ANALYZE prints info about the calculated values and timings (this can be disabled by removing the STATS_CROSS_CHECK define). Finally, brincostestimate() loads the statistics and uses it for costing. At the moment it uses only the average number of overlaps. Trivial example: create table t (a int) with (fillfactor = 10); insert into t select (case when mod(i,22) = 0 then 100000000 when mod(i,22) = 1 then 0 else i end) from generate_series(1,300000) s(i); create index on t using brin (a) with (pages_per_range = 1); The table fits 22 rows per page, and the data is mostly sequential, except that every page has both 0 and 100000000. The correlation however remains fairly high: # select correlation from pg_stats where tablename = 't'; correlation ------------- 0.8303595 (1 row) Now, let's do a simple query: # explain (analyze, buffers, timing off) select * from t where a = 500; QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on t (cost=154.00..254.92 rows=2 width=4) (actual rows=1 loops=1) Recheck Cond: (a = 500) Rows Removed by Index Recheck: 299999 Heap Blocks: lossy=13637 Buffers: shared hit=13695 -> Bitmap Index Scan on t_a_idx (cost=0.00..154.00 rows=26 width=0) (actual rows=136370 loops=1) Index Cond: (a = 500) Buffers: shared hit=58 Planning: Buffers: shared hit=1 Planning Time: 0.173 ms Execution Time: 101.972 ms (12 rows) That's pretty poor, because brincostestimate() still thinks it'll be enough to read one or two page ranges (because 1/0.8 = ~1.2). Now, with the extra statistics: SET enable_indexam_stats = true; ANALYZE t; QUERY PLAN ---------------------------------------------------------------------- Bitmap Heap Scan on t (cost=157.41..17544.41 rows=2 width=4) (actual rows=1 loops=1) Recheck Cond: (a = 500) Rows Removed by Index Recheck: 299999 Heap Blocks: lossy=13637 Buffers: shared hit=13695 -> Bitmap Index Scan on t_a_idx (cost=0.00..157.41 rows=300000 width=0) (actual rows=136370 loops=1) Index Cond: (a = 500) Buffers: shared hit=58 Planning: Buffers: shared hit=1 Planning Time: 0.230 ms Execution Time: 104.603 ms (12 rows) So in this case we realize we actually have to scan the whole table, all ~13637 ranges, and the cost reflects that. Feel free to experiment with other data sets. regards [1] https://www.postgresql.org/message-id/e70fa091-e338-1598-9de4-6d0ef6b693e2%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: