Stats collector's idx_blks_hit value is highly misleading in practice - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Stats collector's idx_blks_hit value is highly misleading in practice |
Date | |
Msg-id | CAH2-WzmdZqxCS1widYzjDAM+Z-Jz=ejJoaWXDVw9Qy1UsK0tLA@mail.gmail.com Whole thread Raw |
Responses |
Re: Stats collector's idx_blks_hit value is highly misleading in practice
|
List | pgsql-hackers |
It occurs to mean that statistics collector stats such as pg_statio_*_tables.idx_blks_hit are highly misleading in practice because they fail to take account of the difference between internal pages and leaf pages in B-Tree indexes. These two types of pages are in fundamentally different categories, and I think that failing to recognize that at the level of these system views makes them much less useful. Somebody should probably write a patch that makes this difference clear from the system views. Possibly by using some generalized notion of "record" pages instead of leaf pages, and "metadata" pages instead of internal pages. That would even work with hash indexes, I think. Consider the following example, which is based on a standard nbtree index, but could work in almost the same way with other index access methods: We have a pgbench_accounts pkey after initialization by pgbench at scale 1500. It has 409,837 leaf pages and 1,451 internal pages, meaning that about one third of one percent of all pages in the index are internal pages. Occasionally, with indexes on large text strings we might notice that as many as 1% of all index pages are internal pages, but that's very much on the high side. Generally speaking, we're virtually guaranteed to have *all* internal pages in shared_buffers once a steady state has been reached. Once the cache warms up, point lookups (like the queries pgbench performs) will only have to access one leaf page at most, which amounts to only one I/O at most. (This asymmetry is the main reason why B-Trees are generally very effective when buffered in a buffer cache.) If we run the pgbench queries against this database/example index we'll find that we have to access 4 index pages per query execution -- the root, two additional internal pages, plus a leaf page. Based on the reasonable assumptions I'm making, 3 out of 4 of those pages will be hits when steady state is reached with pgbench's SELECT-only workload, regardless of how large shared_buffers is or how bloated the index is (we only need 1451 buffers for that, and those are bound to get hot quickly). The overall effect is idx_blks_hit changes over time in a way that makes no sense -- even to an expert. Let's say we start with this entire 3213 MB pgbench index in shared_buffers. We should only get increments in idx_blks_hit, never increments in idx_blks_read - that much makes sense. If we then iteratively shrink shared_buffers (or equivalently, make the index grow without adding a new level), the proportion of page accesses that increment idx_blks_read (rather than incrementing idx_blks_hit) goes up roughly linearly as misses increase linearly - which also makes sense. But here is the silly part: we cannot really have a hit rate of less than 75% if you compare idx_blks_hit to idx_blks_read, unless and until we can barely even fit 1% of the index in memory (at which point it's hard to distinguish from noise). So if we naively consume the current view we'll see a hit rate that starts at 100%, and very slowly shrinks to 75%, which is where we bottom out (more or less, roughly speaking). This behavior seems pretty hard to defend to me. If somebody fixed this by putting internal pages into their own bucket in the system view, then motivated users would quickly learn that internal page stats aren't really useful -- they are only included for completeness. They're such a small contributor to the overall hit rate that they can simply be ignored completely. The thing that users ought to focus on is leaf page hit rate. Now index hit rate (by which I mean leaf page hit rate) actually makes sense. Note that Heroku promoted simple heuristics like this for many years. I suppose that a change like this could end up affecting other things, such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages separately for bitmap scans at the moment, so maybe it could be fairly well targeted. And, maybe this is unappealing given the current statistics collector limitations. I'm not volunteering to work on it right now, but it would be nice to fix this. Please don't wait for me to do it. -- Peter Geoghegan
pgsql-hackers by date: