Home > mailing lists

Re: shared memory stats: high level design decisions: consistency, dropping - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: shared memory stats: high level design decisions: consistency, dropping
Date	March 21, 2021 22:34:45
Msg-id	20210321223445.lwvg66t6elymazoz@alap3.anarazel.de Whole thread Raw
In response to	Re: shared memory stats: high level design decisions: consistency, dropping (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: shared memory stats: high level design decisions: consistency, dropping Re: shared memory stats: high level design decisions: consistency, dropping
List	pgsql-hackers

Tree view

Hi,

On 2021-03-21 12:14:35 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > 1) What kind of consistency do we want from the pg_stats_* views?
>
> That's a hard choice to make.  But let me set the record straight:
> when we did the initial implementation, the stats snapshotting behavior
> was considered a FEATURE, not an "efficiency hack required by the old
> storage model".

Oh - sorry for misstating that then. I did try to look for the origins of the
approach, and all that I found was that it'd be too expensive to do multiple
stats file reads.

> If I understand what you are proposing, all stats views would become
> completely volatile, without even within-query consistency.  That really
> is not gonna work.  As an example, you could get not-even-self-consistent
> results from a join to a stats view if the planner decides to implement
> it as a nestloop with the view on the inside.

I don't really think it's a problem that's worth incuring that much cost to
prevent. We already have that behaviour for a number of of the pg_stat_* views,
e.g. pg_stat_xact_all_tables, pg_stat_replication.

If the cost were low - or we can find a reasonable way to get to low costs - I
think it'd be worth preserving for backward compatibility's sake alone.  From
an application perspective, I actually rarely want that behaviour for stats
views - I'm querying them to get the most recent information, not an older
snapshot. And in the cases I do want snapshots, I'd want them for longer than a
transaction.

There's just a huge difference between being able to access a table's stats in
O(1) time, or having a single stats access be O(database-objects).

And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
(for IO over time stats etc) that often are monitored at a somewhat high
frequency - they also pay the price of reading in all object stats.  On my
example database with 1M tables it takes 0.4s to read pg_stat_database.

We currently also fetch the full stats in places like autovacuum.c. Where we
don't need repeated access to be consistent - we even explicitly force the
stats to be re-read for every single table that's getting vacuumed.

Even if we to just cache already accessed stats, places like do_autovacuum()
would end up with a completely unnecessary cache of all tables, blowing up
memory usage by a large amount on systems with lots of relations.

> I also believe that the snapshotting behavior has advantages in terms
> of being able to perform multiple successive queries and get consistent
> results from them.  Only the most trivial sorts of analysis don't need
> that.

In most cases you'd not do that in a transaction tho, and you'd need to create
temporary tables with a snapshot of the stats anyway.

> In short, what you are proposing sounds absolutely disastrous for
> usability of the stats views, and I for one will not sign off on it
> being acceptable.

:(

That's why I thought it'd be important to bring this up to a wider
audience. This has been discussed several times in the thread, and nobody
really chimed up wanting the "snapshot" behaviour...

> I do think we could relax the consistency guarantees a little bit,
> perhaps along the lines of only caching view rows that have already
> been read, rather than grabbing everything up front.  But we can't
> just toss the snapshot concept out the window.  It'd be like deciding
> that nobody needs MVCC, or even any sort of repeatable read.

I think that'd still a huge win - caching only what's been accessed rather than
everything will save a lot of memory in very common cases. I did bring it up as
one approach for that reason.

I do think it has a few usability quirks though. The time-skew between stats
objects accessed at different times seems like it could be quite confusing?
E.g. imagine looking at table stats and then later join to index stats and see
table / index stats not matching up at all.

I wonder if a reasonable way out could be to have pg_stat_make_snapshot()
(accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager
data load. But not use any snapshot / caching behaviour without that?

It's be a fair bit of code to have that, but I think can see a way to have it
not be too bad?

Greetings,

Andres Freund

pgsql-hackers by date:

From: Peter Geoghegan
Date: 21 March 2021, 22:27:40
Subject: Re: 64-bit XIDs in deleted nbtree pages

From: Peter Smith
Date: 21 March 2021, 22:44:15
Subject: Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

Re: shared memory stats: high level design decisions: consistency, dropping - Mailing list pgsql-hackers

Previous

Next