estimate correlation of index separately from table (Re: [PERFORM]index fragmentation on insert-only table with non-unique column) - Mailing list pgsql-performance
From | Justin Pryzby |
---|---|
Subject | estimate correlation of index separately from table (Re: [PERFORM]index fragmentation on insert-only table with non-unique column) |
Date | |
Msg-id | 20170707234119.GN17566@telsasoft.com Whole thread Raw |
In response to | Re: index fragmentation on insert-only table with non-unique column (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: estimate correlation of index separately from table (Re:[PERFORM] index fragmentation on insert-only table with non-unique column)
|
List | pgsql-performance |
Months ago I reported an issue with very slow index scan of tables with high "correlation" of its indexed column, due to (we concluded at the time) duplicate/repeated values of that column causing many lseek()s. References: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com My interim workarounds were an reindex job and daily granularity partitions (despite having an excessive number of child tables) to query execution time]] to encourage seq scans for daily analysis jobs rather than idx scan. I've now cobbled together a patch to throw around and see if we can improve on that. I tried several changes, hoping to discourage index scan. The logic that seems to most accurately reflect costs has two changes: Postgres' existing behavior stores table correlation (heap value vs. position) but doesn't look at index correlation, so can't distinguish between a just-built index, and a highly fragmented index, or one with highly-nonsequential TIDs. My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine correlation of heap TID vs index tuple logical location (as opposed to the table correlation, computed as: heap TID vs. heap value). The second change averages separate correlation values of small lists of 300 consecutive TIDs, rather than the course-granularity/aggregate correlation of a small sample of pages across the entire index. Postgres' existing sampling is designed to give an even sample across all rows. An issue with this course-granularity correlation is that the index can have a broad correlation (to physical heap location), but with many small-scale deviations, which don't show up due to sampling a relatively small fraction of a large table; and/or the effect of the deviations is insignificant/noise and correlation is still computed near 1. I believe the "large scale" correlation that postgres computes from block sample fails to well represent small-scale uncorrelated reads which contribute large number of random reads, but not included in planner cost. Not only are the index reads highly random (which the planner already assumes), but the CTIDs referenced within a given btree leaf page are also substantially non-sequential. It seems to affect INSERTs which, over a short interval of time have a small-moderate range of column values, but which still have a strong overall trend in that column WRT time (and heap location). Think of inserting a "normal" distribution of timestamps centered around now(). My original report shows lseek() for nearly every read on the *heap*. The original problem was on a table of telecommunications CDRs, indexed by "record opening time" (start time) and with high correlation value in pg_stats. We import data records from a file, which is probably more or less in order of "end time". That still displays broad correlation on "start time", but individual TIDs are nowhere near sequential. Two phone calls which end in the same 1 second interval are not unlikely to have started many minutes apart... but two calls which end within the same second are very likely to have started within an hour of each other.. since typical duration is <1h. But, insert volume is high, and there are substantial repeated keys, so the portion of an index scan returning CTIDs for some certain key value (timestamp with second resolution in our case) ends up reading heap tuples for a non-negligible fraction of the table: maybe only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ... which is still 300 seeks, and takes long enough that cache is inadequate to substantially mitigate the cost. It's not clear to me that there's a good way to evenly *sample* a fraction of the index blocks in a manner which is agnostic to different AMs. Scanning the entirety of all indices on relations during (auto) analyze may be too expensive. So I implemented index scan of the MCV list. I'm guessing this might cause the correlation to be under-estimated, and prefer bitmap scans somewhat more than justified, due to btree insertion logic for repeated keys, to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with eg. normally distributed floating point values (or timestamp with fractional seconds). I ran pageinspect on a recent child of the table that triggered the original report: ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET1; itemoffset | ctid ------------+-------- 2 | (81,4) 3 | (5,6) 4 | (6,5) 5 | (9,1) 6 | (10,1) 7 | (14,1) 8 | (21,8) 9 | (25,1) 10 | (30,5) 11 | (41,1) 12 | (48,7) 13 | (49,3) 14 | (50,2) 15 | (51,1) 16 | (54,1) 17 | (57,7) 18 | (58,6) 19 | (59,6) 20 | (63,5) 21 | (71,6) 22 | (72,5) 23 | (78,7) (22 rows) => 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page cost. To generate data demonstrating this problem you can do things like: | CREATE TABLE t(i int, j int); | CREATE INDEX ON t(i); \! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or, or: | INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or, or this one, perhaps closest to our problem case: | INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9; Note, I was able to create a case using floats without repeated keys: | INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i | ANALYZE t; -- note: the correlation is even higher for larger tables with same number of -- repeated keys, which is bad since the cost should go up linearly with the -- size and associated number of lseek(). That's one component of the problem -- and I think a necessarily component of any solution. postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%'; tablename | attname | correlation | array_length -----------+---------+-------------+-------------- t | i | 0.995212 | 87 t | j | | t_i_idx | i | -0.892874 | 87 ^^^^^^^ ... this last line is added by my patch. That's a bad combination, high table correlation means the planner thinks only a fraction of the heap will be read, and sequentially, so isn't discouraged from doing index scan. But index TIDs are much less well correlated (0.89 vs 0.99). Note: the negative correlation at tuple-level seems to be related to this comment: src/backend/access/nbtree/nbtinsert.c- * Once we have chosen the page to put the key on, we'll insert it before src/backend/access/nbtree/nbtinsert.c: * any existing equal keys because of the way _bt_binsrch() works. Note also: |postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx'); |leaf_fragmentation | 67.63 .. but keep in mind that leaf_fragmentation only checks leaf page order, and not CTID order within index pages. The planner already assumes index pages are random reads. Maybe it's a good indicator (?), but doesn't lend itself to an accurate cost estimation. For testing purposes, with: | shared_buffers | 128kB | public | t | table | pryzbyj | 35 MB | | SET enable_bitmapscan=off; | SET enable_seqscan=off; | SELECT pg_backend_pid(); | SELECT sum(j) FROM t WHERE i<99; | Time: 3974.205 ms % sudo strace -p 530 2>/tmp/strace-pg % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head 39474 lseek(41, 359 lseek(44, 8 lseek(18, => 40k seeks on an 35MB table, that's 10 seeks per heap page! open("base/12411/16634", O_RDWR) = 41 open("base/12411/16636", O_RDWR) = 44 postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630; 16634 | t 16636 | t_i_idx 2017-07-07 17:45:54.075 CDT [6360] WARNING: HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000costsize.c cost_index 702 With my patch, index scan estimated to take ~4400 seeks, rather than actual 40k, probably because max_IO_cost assumes that a given heap page will be visited only once. But in the worst case each of its tuples would require a separate lseek().... I'm not suggesting to change that, since making max_IO 100x higher will probably change query plans more dramatically than desired.. But also note that, unpatched, with table correlation >0.99, postgres would've under-estimated min_IO_cost not by a factor of 10x but by 400x. | postgres=# REINDEX INDEX t_i_idx; | postgres=# ANALYZE t; | postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE't%'; | tablename | attname | correlation | array_length | -----------+---------+-------------+-------------- | t | i | 0.995235 | 67 | t_i_idx | i | 1 | 67 => Correctly distinguishes freshly reindexed table. % sudo strace -p 6428 2>/tmp/strace-pg8 % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr 99 lseek(37, 2017-07-07 17:49:47.745 CDT [6428] WARNING: HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000costsize.c cost_index 702 => csquared=1, so min_IO cost is used instead of something close to max_IO (this patch doesn't change the computation of those values at all, just changes correlation, which squared value is used to interpolate between correlated/min cost and uncorrelated/max cost). correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched postgres would've computed by using the table correlation of 0.9952, implicitly assuming the index to be similarly correlated. I hope that demonstrates the need to distinguish index correlation from table correlation. I'm hoping for comments on the existing patch, specifically if there's a better way to sample the index than "MCVs or partial index scan". I've left some fragments in place of earlier implementation involving btree page level fragmentation (like statindex). Also: does it make sense to keeping MCV/histogram for non-expression index which duplicates table column ? The stats lookup in selfuncs.c btcostestimate() would have to check for correlation from index, and rest of stats from its table. A bitmap layer adds overhead, which should be avoided if not needed. But it shouldn't be a huge impact, and I think its relative effect is only high when returning a small number of rows. I'm thinking of a few cases. - unique / low cardinality index scans or without duplicate keys: should have low index_pages_fetched(), so max_IO_cost should not be very different from min_io_cost, and new index-based correlation shouldn't have much effect different from table correlation. - unclustered/uncorrelated tables: tables whose heap have low correlation already discouraged from index scan; this includes tables whose column is UPDATEd and not just INSERTed; - table with correlated heap AND index: csquared should still be ~0.99 and not change much; - correlated table with uncorrelated index: this is the target case with intended behavior change My apologies: I think that's a bit of a brain dump. Justin
Attachment
pgsql-performance by date: