On Wed, Oct 8, 2025 at 4:20 PM Tomas Vondra <tomas@vondra.me> wrote:
> What I imagined is more like a process that regularly walks shared
> buffers, counts buffers per relation (or relfilenode), stores the
> aggregated info into some shared memory (so that standby can have it's
> own concept of cache contents). And then invalidates plans the same way
> ANALYZE does.
I'm not saying we couldn't do this, but I have doubts about how well
it would really work. I have a feeling, for example, that it might
sometimes cause the planner to go to great lengths to avoid bringing a
small table fully into memory, based on the knowledge that the table
is not cached. But the right thing to do could easily be to take the
hit of doing an expensive sequential scan on first access, and then
after that the table stays cached. I think it could very possibly be
the case that such a strategy is faster even for the first execution
and also more stable. In my experience, if you run pgbench -S on a
system with a cold cache, it takes just about forever to warm the
cache via random access; a single pg_prewarm can avoid a very long
period of slowness.
I think Andres's idea of trying to account for things like backward
scans being slower, or prefetching, or AIO, is probably a better
direction. I don't know how well we can take those things into account
or whether it will fully get us where we want to be, but we can model
those events in ways that are stable: the cost is just always
different than today, without relying on any new kinds of stats. I
think that's probably good.
In my experience, we get complaints about plans changing when people
didn't want them to all the time, but we very rarely get complaints
about plans not changing when people did want them to. So I think
trying not to depend on things that can be unstable is probably a good
direction.
--
Robert Haas
EDB: http://www.enterprisedb.com