Re: shared-memory based stats collector - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: shared-memory based stats collector |
Date | |
Msg-id | 20200313031324.4al3nnwxnzl2j4sb@alap3.anarazel.de Whole thread Raw |
In response to | Re: shared-memory based stats collector (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: shared-memory based stats collector
Re: shared-memory based stats collector |
List | pgsql-hackers |
Hi, Thomas, could you look at the first two patches here, and my review questions? General comments about this series: - A lot of the review comments feel like I've written them before, a year or more ago. I feel this patch ought to be in a much better state. There's a lot of IMO fairly obvious stuff here, and things that have been mentioned multiple times previously. - There's a *lot* of typos in here. I realize being an ESL is hard, but a lot of these can be found with the simplest spellchecker. That's one thing for a patch that just has been hacked up as a POC, but this is a multi year thread? - There's some odd formatting. Consider using pgindent more regularly. More detailed comments below. I'm considering rewriting the parts of the patchset that I don't like - but it'll look quite different afterwards. On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote: > From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001 > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> > Date: Fri, 29 Jun 2018 16:41:04 +0900 > Subject: [PATCH 1/5] sequential scan for dshash > > Add sequential scan feature to dshash. > dsa_pointer item_pointer = hash_table->buckets[i]; > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry) > LW_EXCLUSIVE)); > > delete_item(hash_table, item); > - hash_table->find_locked = false; > - hash_table->find_exclusively_locked = false; > - LWLockRelease(PARTITION_LOCK(hash_table, partition)); > + > + /* We need to keep partition lock while sequential scan */ > + if (!hash_table->seqscan_running) > + { > + hash_table->find_locked = false; > + hash_table->find_exclusively_locked = false; > + LWLockRelease(PARTITION_LOCK(hash_table, partition)); > + } > } This seems like a failure prone API. > /* > @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry) > Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index), > hash_table->find_exclusively_locked > ? LW_EXCLUSIVE : LW_SHARED)); > + /* lock is under control of sequential scan */ > + Assert(!hash_table->seqscan_running); > > hash_table->find_locked = false; > hash_table->find_exclusively_locked = false; > @@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg) > return tag_hash(v, size); > } > > +/* > + * dshash_seq_init/_next/_term > + * Sequentially scan trhough dshash table and return all the > + * elements one by one, return NULL when no more. s/trhough/through/ This uses a different comment style that the other functions in this file. Why? > + * dshash_seq_term should be called for incomplete scans and otherwise > + * shoudln't. Finished scans are cleaned up automatically. s/shoudln't/shouldn't/ I find the "cleaned up automatically" API terrible. I know you copied it from dynahash, but I find it to be really failure prone. dynahash isn't an example of good postgres code, the opposite, I'd say. It's a lot easier to unconditionally have a terminate call if we need that. > + * Returned elements are locked as is the case with dshash_find. However, the > + * caller must not release the lock. > + * > + * Same as dynanash, the caller may delete returned elements midst of a scan. I think it's a bad idea to refer to dynahash here. That's just going to get out of date. Also, code should be documented on its own. > + * If consistent is set for dshash_seq_init, the all hash table partitions are > + * locked in the requested mode (as determined by the exclusive flag) during > + * the scan. Otherwise partitions are locked in one-at-a-time way during the > + * scan. Yet delete unconditionally retains locks? > + */ > +void > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table, > + bool consistent, bool exclusive) > +{ Why does this patch add the consistent mode? There's no users currently? Without it's not clear that we need a seperate _term function, I think? I think we also can get rid of the dshash_delete changes, by instead adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API or such. > @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area, > extern void dshash_detach(dshash_table *hash_table); > extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table); > extern void dshash_destroy(dshash_table *hash_table); > - > /* Finding, creating, deleting entries. */ > extern void *dshash_find(dshash_table *hash_table, > const void *key, bool > exclusive); There's a number of spurious changes like this. > From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001 > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> > Date: Thu, 27 Sep 2018 11:15:19 +0900 > Subject: [PATCH 2/5] Add conditional lock feature to dshash > > Dshash currently waits for lock unconditionally. This commit adds new > interfaces for dshash_find and dshash_find_or_insert. The new > interfaces have an extra parameter "nowait" taht commands not to wait > for lock. s/taht/that/ There should be at least a sentence or two explaining why these are useful. > +/* > + * The version of dshash_find, which is allowed to return immediately on lock > + * failure. Lock status is set to *lock_failed in that case. > + */ Hm. Not sure I like the *lock_acquired API. > +void * > +dshash_find_extended(dshash_table *hash_table, const void *key, > + bool exclusive, bool nowait, bool *lock_acquired) > { > dshash_hash hash; > size_t partition; > dshash_table_item *item; > > + /* > + * No need to return lock resut when !nowait. Otherwise the caller may > + * omit the lock result when NULL is returned. > + */ > + Assert(nowait || !lock_acquired); > + > hash = hash_key(hash_table, key); > partition = PARTITION_FOR_HASH(hash); > > Assert(hash_table->control->magic == DSHASH_MAGIC); > Assert(!hash_table->find_locked); > > - LWLockAcquire(PARTITION_LOCK(hash_table, partition), > - exclusive ? LW_EXCLUSIVE : LW_SHARED); > + if (nowait) > + { > + if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition), > + exclusive ? LW_EXCLUSIVE : LW_SHARED)) > + { > + if (lock_acquired) > + *lock_acquired = false; Why is the test for lock_acquired needed here? I don't think it's possible to use nowait correctly without passing in lock_acquired? Think it'd make sense to document & assert that nowait = true implies lock_acquired set, and nowait = false implies lock_acquired not being set. But, uh, why do we even need the lock_acquired parameter? If we couldn't find an entry, then we should just release the lock, no? I'm however inclined to think it's better to just have a separate function for the nowait case, rather than an extended version supporting both (with an internal helper doing most of the work). > +/* > + * The version of dshash_find_or_insert, which is allowed to return immediately > + * on lock failure. > + * > + * Notes above dshash_find_extended() regarding locking and error handling > + * equally apply here. They don't, there's no lock_acquired parameter. > + */ > +void * > +dshash_find_or_insert_extended(dshash_table *hash_table, > + const void *key, > + bool *found, > + bool nowait) I think it's absurd to have dshash_find, dshash_find_extended, dshash_find_or_insert, dshash_find_or_insert_extended. If they're extended they should also be able to specify whether the entry will get created. > From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001 > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> > Date: Wed, 7 Nov 2018 16:53:49 +0900 > Subject: [PATCH 3/5] Make archiver process an auxiliary process > > This is a preliminary patch for shared-memory based stats collector. > Archiver process must be a auxiliary process since it uses shared > memory after stats data wes moved onto shared-memory. Make the process s/wes/was/ s/onto/into/ > an auxiliary process in order to make it work. > > @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[]) > StartupProcessMain(); > proc_exit(1); /* should never return */ > > + case ArchiverProcess: > + /* don't set signals, archiver has its own agenda */ > + PgArchiverMain(); > + proc_exit(1); /* should never return */ > + > case BgWriterProcess: > /* don't set signals, bgwriter has its own agenda */ > BackgroundWriterMain(); I think I'd rather remove the two comments that are copied to 6 out of 8 cases - they don't add anything. > /* ------------------------------------------------------------ > * Local functions called by archiver follow > * ------------------------------------------------------------ > @@ -219,8 +148,8 @@ pgarch_forkexec(void) > * The argc/argv parameters are valid only in EXEC_BACKEND case. However, > * since we don't use 'em, it hardly matters... > */ > -NON_EXEC_STATIC void > -PgArchiverMain(int argc, char *argv[]) > +void > +PgArchiverMain(void) > { > /* > * Ignore all signals usually bound to some action in the postmaster, > @@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[]) > static void > pgarch_exit(SIGNAL_ARGS) > { > - /* SIGQUIT means curl up and die ... */ > - exit(1); > + PG_SETMASK(&BlockSig); > + > + /* > + * We DO NOT want to run proc_exit() callbacks -- we're here because > + * shared memory may be corrupted, so we don't want to try to clean up our > + * transaction. Just nail the windows shut and get out of town. Now that > + * there's an atexit callback to prevent third-party code from breaking > + * things by calling exit() directly, we have to reset the callbacks > + * explicitly to make this work as intended. > + */ > + on_exit_reset(); > + > + /* > + * Note we do exit(2) not exit(0). This is to force the postmaster into a > + * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random > + * process. This is necessary precisely because we don't clean up our > + * shared memory state. (The "dead man switch" mechanism in pmsignal.c > + * should ensure the postmaster sees this as a crash, too, but no harm in > + * being doubly sure.) > + */ > + exit(2); > } > This seems to be a copy of code & comments from other signal handlers that predates commit 8e19a82640d3fa2350db146ec72916856dd02f0a Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: 2018-08-08 19:08:10 +0300 Don't run atexit callbacks in quickdie signal handlers. I think this just should use SignalHandlerForCrashExit(). I think we can even commit that separately - there's not really a reason to not do that today, as far as I can tell? > /* SIGUSR1 signal handler for archiver process */ Hm - this currently doesn't set up a correct sigusr1 handler for a shared memory backend - needs to invoke procsignal_sigusr1_handler somewhere. We can probably just convert to using normal latches here, and remove the current 'wakened' logic? That'll remove the indirection via postmaster too, which is nice. > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType) > > switch (backendType) > { > + case B_ARCHIVER: > + backendDesc = "archiver"; > + break; should imo include 'WAL' or such. > From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001 > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> > Date: Thu, 21 Feb 2019 12:44:56 +0900 > Subject: [PATCH 4/5] Shared-memory based stats collector > > Previously activity statistics is shared via files on disk. Every > backend sends the numbers to the stats collector process via a socket. > It makes snapshots as a set of files on disk with a certain interval > then every backend reads them as necessary. It worked fine for > comparatively small set of statistics but the set is under the > pressure to growing up and the file size has reached the order of > megabytes. To deal with larger statistics set, this patch let backends > directly share the statistics via shared memory. This spends a fair bit describing the old state, but very little describing the new state. > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > index 0bfd6151c4..a6b0bdec12 100644 > --- a/doc/src/sgml/monitoring.sgml > +++ b/doc/src/sgml/monitoring.sgml > @@ -53,7 +53,6 @@ postgres 15554 0.0 0.0 57536 1184 ? Ss 18:02 0:00 postgres: back > postgres 15555 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: checkpointer > postgres 15556 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: walwriter > postgres 15557 0.0 0.0 58504 2244 ? Ss 18:02 0:00 postgres: autovacuum launcher > -postgres 15558 0.0 0.0 17512 1068 ? Ss 18:02 0:00 postgres: stats collector > postgres 15582 0.0 0.0 58772 3080 ? Ss 18:04 0:00 postgres: joe runbug 127.0.0.1 idle > postgres 15606 0.0 0.0 58772 3052 ? Ss 18:07 0:00 postgres: tgl regression [local] SELECT waiting > postgres 15610 0.0 0.0 58772 3056 ? Ss 18:07 0:00 postgres: tgl regression [local] idle in transaction > @@ -65,9 +64,8 @@ postgres 15610 0.0 0.0 58772 3056 ? Ss 18:07 0:00 postgres: tgl > master server process. The command arguments > shown for it are the same ones used when it was launched. The next five > processes are background worker processes automatically launched by the > - master process. (The <quote>stats collector</quote> process will not be present > - if you have set the system not to start the statistics collector; likewise > - the <quote>autovacuum launcher</quote> process can be disabled.) > + master process. (The <quote>autovacuum launcher</quote> process will not > + be present if you have set the system not to start it.) > Each of the remaining > processes is a server process handling one client connection. Each such > process sets its command line display in the form There's more references to the stats collector than this... E.g. in catalogs.sgml <xref linkend="view-table"/> lists the system views described here. More detailed documentation of each view follows below. There are some additional views that provide access to the results of the statistics collector; they are described in <xref linkend="monitoring-stats-views-table"/>. </para> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > index 6d1f28c327..8dcb0fb7f7 100644 > --- a/src/backend/postmaster/autovacuum.c > +++ b/src/backend/postmaster/autovacuum.c > @@ -1956,15 +1956,15 @@ do_autovacuum(void) > ALLOCSET_DEFAULT_SIZES); > MemoryContextSwitchTo(AutovacMemCxt); > > + /* Start a transaction so our commands have one to play into. */ > + StartTransactionCommand(); > + > /* > * may be NULL if we couldn't find an entry (only happens if we are > * forcing a vacuum for anti-wrap purposes). > */ > dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId); > > - /* Start a transaction so our commands have one to play into. */ > - StartTransactionCommand(); > - > /* > * Clean up any dead statistics collector entries for this DB. We always > * want to do this exactly once per DB-processing cycle, even if we find > @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared, > if (isshared) > { > if (PointerIsValid(shared)) > - tabentry = hash_search(shared->tables, &relid, > - HASH_FIND, NULL); > + tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid); > } > else if (PointerIsValid(dbentry)) > - tabentry = hash_search(dbentry->tables, &relid, > - HASH_FIND, NULL); > + tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid); > > return tabentry; > } Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside the stats subsystem there are exactly one caller for the non extended version, as far as I can see. That's index_concurrently_swap() - and imo that's code that should live in the stats subsystem, rather than open coded in index.c. > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c > index ca5c6376e5..1ffe073a1f 100644 > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -1,15 +1,23 @@ > /* ---------- > * pgstat.c > * > - * All the statistics collector stuff hacked up in one big, ugly file. > + * Statistics collector facility. > * > - * TODO: - Separate collector, postmaster and backend stuff > - * into different files. > + * Collects per-table and per-function usage statistics of all backends on > + * shared memory. pg_count_*() and friends are the interface to locally store > + * backend activities during a transaction. Then pgstat_flush_stat() is called > + * at the end of a transaction to pulish the local stats on shared memory. > * I'd rather not exhaustively list the different objects this handles - it'll either be annoying to maintain, or just get out of date. > - * - Add some automatic call for pgstat vacuuming. > + * To avoid congestion on the shared memory, we update shared stats no more > + * often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where > + * all the local numbers cannot be flushed immediately, we postpone updates > + * and try the next chance after the interval of > + * PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than > + * PGSTAT_STAT_MAX_INTERVAL(1000ms). I'm not convinced by this backoff logic. The basic interval seems quite high for something going through shared memory, and the max retry seems pretty low. > +/* > + * Operation mode and return code of pgstat_get_db_entry. > + */ > +#define PGSTAT_SHARED 0 This is unreferenced. > +#define PGSTAT_EXCLUSIVE 1 > +#define PGSTAT_NOWAIT 2 And these should imo rather be parameters. > +typedef enum PgStat_TableLookupResult > +{ > + NOT_FOUND, > + FOUND, > + LOCK_FAILED > +} PgStat_TableLookupResult; This seems like a seriously bad idea to me. These are very generic names. There's also basically no references except setting them to the first two? > +#define StatsLock (&StatsShmem->StatsMainLock) > > -static time_t last_pgstat_start_time; > +/* Shared stats bootstrap information */ > +typedef struct StatsShmemStruct > +{ > + LWLock StatsMainLock; /* lock to protect this struct */ > + dsa_handle stats_dsa_handle; /* DSA handle for stats data */ > + dshash_table_handle db_hash_handle; > + dsa_pointer global_stats; > + dsa_pointer archiver_stats; > + int refcount; > +} StatsShmemStruct; Why isn't this an lwlock in lwlock in lwlocknames.h, rather than allocated here? > +/* > + * BgWriter global statistics counters. The name cntains a remnant from the > + * time when the stats collector was a dedicate process, which used sockets to > + * send it. > + */ > +PgStat_MsgBgWriter BgWriterStats = {0}; I am strongly against keeping the 'Msg' prefix. That seems extremely confusing going forward. > +/* common header of snapshot entry in reader snapshot hash */ > +typedef struct PgStat_snapshot > +{ > + Oid key; > + bool negative; > + void *body; /* end of header part: to keep alignment */ > +} PgStat_snapshot; > +/* context struct for snapshot_statentry */ > +typedef struct pgstat_snapshot_param > +{ > + char *hash_name; /* name of the snapshot hash */ > + int hash_entsize; /* element size of hash entry */ > + dshash_table_handle dsh_handle; /* dsh handle to attach */ > + const dshash_parameters *dsh_params;/* dshash params */ > + HTAB **hash; /* points to variable to hold hash */ > + dshash_table **dshash; /* ditto for dshash */ > +} pgstat_snapshot_param; Why does this exist? The struct contents are actually constant across calls, yet you have declared them inside functions (as static - static on function scope isn't really the same as global static). If we want it, I think we should separate the naming more meaningfully. The important difference between 'hash' and 'dshash' isn't the hashing module, it's that one is a local copy, the other a shared hashtable! > +/* > + * Backends store various database-wide info that's waiting to be flushed out > + * to shared memory in these variables. > + * > + * checksum_failures is the exception in that it is cluster-wide value. > + */ > +typedef struct BackendDBStats > +{ > + int n_conflict_tablespace; > + int n_conflict_lock; > + int n_conflict_snapshot; > + int n_conflict_bufferpin; > + int n_conflict_startup_deadlock; > + int n_deadlocks; > + size_t n_tmpfiles; > + size_t tmpfilesize; > + HTAB *checksum_failures; > +} BackendDBStats; Why is this a separate struct from PgStat_StatDBEntry? We should have these fields in multiple places. > + if (StatsShmem->refcount > 0) > + StatsShmem->refcount++; What prevents us from leaking the refcount here? We could e.g. error out while attaching, no? Which'd mean we'd leak the refcount. To me it looks like there's a lot of added complexity just because you want to be able to reset stats via void pgstat_reset_all(void)a { /* * We could directly remove files and recreate the shared memory area. But * detach then attach for simplicity. */ pgstat_detach_shared_stats(false); /* Don't write */ pgstat_attach_shared_stats(); Without that you'd not need the complexity of attaching, detaching to the same degree - every backend could just cache lookup data during initialization, instead of having to constantly re-compute that. Nor would the dynamic re-creation of the db dshash table be needed. > +/* ---------- > + * pgstat_report_stat() - > + * > + * Must be called by processes that performs DML: tcop/postgres.c, logical > + * receiver processes, SPI worker, etc. to apply the so far collected > + * per-table and function usage statistics to the shared statistics hashes. > + * > + * Updates are applied not more frequent than the interval of > + * PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock > + * failure if force is false and there's no pending updates longer than > + * PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in > + * succeeding calls of this function. > + * > + * Returns the time until the next timing when updates are applied in > + * milliseconds if there are no updates holded for more than > + * PGSTAT_STAT_MIN_INTERVAL milliseconds. > + * > + * Note that this is called only out of a transaction, so it is fine to use > + * transaction stop time as an approximation of current time. > + * ---------- > + */ Inconsistent indentation. > +long > +pgstat_report_stat(bool force) > { > + /* Flush out table stats */ > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force)) > + pending_stats = true; > + > + /* Flush out function stats */ > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force)) > + pending_stats = true; This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides being confusing while reading the code, it also made the diff much harder to read. > - snprintf(fname, sizeof(fname), "%s/%s", directory, > - entry->d_name); > - unlink(fname); > + /* Flush out database-wide stats */ > + if (HAVE_PENDING_DBSTATS()) > + { > + if (!pgstat_flush_dbstats(&cxt, !force)) > + pending_stats = true; > } Linearly checking a number of stats doesn't seem like the right way going forward. Also seems fairly omission prone. Why does this code check live in pgstat_report_stat(), rather than pgstat_flush_dbstats()? > /* > * snapshot_statentry() - Common routine for functions > * pgstat_fetch_stat_*entry() > * Why has this function been added between the closely linked pgstat_report_stat() and pgstat_flush_stat() etc? > * Returns the pointer to a snapshot of a shared entry for the key or NULL if > * not found. Returned snapshots are stable during the current transaction or > * until pgstat_clear_snapshot() is called. > * > * The snapshots are stored in a hash, pointer to which is stored in the > * *HTAB variable pointed by cxt->hash. If not created yet, it is created > * using hash_name, hash_entsize in cxt. > * > * cxt->dshash points to dshash_table for dbstat entries. If not yet > * attached, it is attached using cxt->dsh_handle. Why do we still have this? A hashtable lookup is cheap, compared to fetching a file - so it's not to save time. Given how infrequent the pgstat_fetch_* calls are, it's not to avoid contention either. At first one could think it's for consistency - but no, that's not it either, because snapshot_statentry() refetches the snapshot without control from the outside: > /* > * We don't want so frequent update of stats snapshot. Keep it at least > * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue. > */ > if (clear_snapshot) > { > clear_snapshot = false; > > if (pgStatSnapshotContext && > snapshot_globalStats.stats_timestamp < > GetCurrentStatementStartTimestamp() - > PGSTAT_STAT_MIN_INTERVAL * 1000) > { > MemoryContextReset(pgStatSnapshotContext); > > /* Reset variables */ > global_snapshot_is_valid = false; > pgStatSnapshotContext = NULL; > pgStatLocalHash = NULL; > > pgstat_setup_memcxt(); > } > } I think we should just remove this entire local caching snapshot layer for lookups. > /* > * pgstat_flush_stat: Flushes table stats out to shared statistics. > * Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats or such? Given that the code for dealing with an individual table's entry is named pgstat_flush_tabstat() that's very confusing. > * If nowait is true, returns false if required lock was not acquired > * immediately. In that case, unapplied table stats updates are left alone in > * TabStatusArray to wait for the next chance. cxt holds some dshash related > * values that we want to carry around while updating shared stats. > * > * Returns true if all stats info are flushed. Caller must detach dshashes > * stored in cxt after use. > */ > static bool > pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait) > { > static const PgStat_TableCounts all_zeroes; > TabStatusArray *tsa; > HTAB *new_tsa_hash = NULL; > TabStatusArray *dest_tsa = pgStatTabList; > int dest_elem = 0; > int i; > > /* nothing to do, just return */ > if (pgStatTabHash == NULL) > return true; > > /* > * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry > * entries it points to. > */ > hash_destroy(pgStatTabHash); > pgStatTabHash = NULL; > > /* > * Scan through the TabStatusArray struct(s) to find tables that actually > * have counts, and try flushing it out to shared stats. We may fail on > * some entries in the array. Leaving the entries being packed at the > * beginning of the array. > */ > for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next) > { It seems odd that there's a tabstat specific code in pgstat_flush_stat (also note singular while it's processing all stats, whereas you're below treating pgstat_flush_tabstat as only affecting one table). > for (i = 0; i < tsa->tsa_used; i++) > { > PgStat_TableStatus *entry = &tsa->tsa_entries[i]; > > /* Shouldn't have any pending transaction-dependent counts */ > Assert(entry->trans == NULL); > > /* > * Ignore entries that didn't accumulate any actual counts, such > * as indexes that were opened by the planner but not used. > */ > if (memcmp(&entry->t_counts, &all_zeroes, > sizeof(PgStat_TableCounts)) == 0) > continue; > > /* try to apply the tab stats */ > if (!pgstat_flush_tabstat(cxt, nowait, entry)) > { > /* > * Failed. Move it to the beginning in TabStatusArray and > * leave it. > */ > TabStatHashEntry *hash_entry; > bool found; > > if (new_tsa_hash == NULL) > new_tsa_hash = create_tabstat_hash(); > > /* Create hash entry for this entry */ > hash_entry = hash_search(new_tsa_hash, &entry->t_id, > HASH_ENTER, &found); > Assert(!found); > > /* > * Move insertion pointer to the next segment if the segment > * is filled up. > */ > if (dest_elem >= TABSTAT_QUANTUM) > { > Assert(dest_tsa->tsa_next != NULL); > dest_tsa = dest_tsa->tsa_next; > dest_elem = 0; > } > > /* > * Pack the entry at the begining of the array. Do nothing if > * no need to be moved. > */ > if (tsa != dest_tsa || i != dest_elem) > { > PgStat_TableStatus *new_entry; > new_entry = &dest_tsa->tsa_entries[dest_elem]; > *new_entry = *entry; > > /* use new_entry as entry hereafter */ > entry = new_entry; > } > > hash_entry->tsa_entry = entry; > dest_elem++; > } This seems like too much code. Why is this entirely different from the way funcstats works? The difference was already too big before, but this made it *way* worse. One goal of this project, as I understand it, is to make it easier to add additional stats. As is, this seems to make it harder from the code level. > bool > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait, > PgStat_TableStatus *entry) > { > Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId; > int table_mode = PGSTAT_EXCLUSIVE; > bool updated = false; > dshash_table *tabhash; > PgStat_StatDBEntry *dbent; > int generation; > > if (nowait) > table_mode |= PGSTAT_NOWAIT; > > /* Attach required table hash if not yet. */ > if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL) > { > /* > * Return if we don't have corresponding dbentry. It would've been > * removed. > */ > dbent = pgstat_get_db_entry(dboid, table_mode, NULL); > if (!dbent) > return false; > > /* > * We don't hold lock on the dbentry since it cannot be dropped while > * we are working on it. > */ > generation = pin_hashes(dbent); > tabhash = attach_table_hash(dbent, generation); This again is just cost incurred by insisting on destroying hashtables instead of keeping them around as long as necessary. > if (entry->t_shared) > { > cxt->shgeneration = generation; > cxt->shdbentry = dbent; > cxt->shdb_tabhash = tabhash; > } > else > { > cxt->mygeneration = generation; > cxt->mydbentry = dbent; > cxt->mydb_tabhash = tabhash; > > /* > * We come here once per database. Take the chance to update > * database-wide stats > */ > LWLockAcquire(&dbent->lock, LW_EXCLUSIVE); > dbent->n_xact_commit += pgStatXactCommit; > dbent->n_xact_rollback += pgStatXactRollback; > dbent->n_block_read_time += pgStatBlockReadTime; > dbent->n_block_write_time += pgStatBlockWriteTime; > LWLockRelease(&dbent->lock); > pgStatXactCommit = 0; > pgStatXactRollback = 0; > pgStatBlockReadTime = 0; > pgStatBlockWriteTime = 0; > } > } > else if (entry->t_shared) > { > dbent = cxt->shdbentry; > tabhash = cxt->shdb_tabhash; > } > else > { > dbent = cxt->mydbentry; > tabhash = cxt->mydb_tabhash; > } > > > /* > * Local table stats should be applied to both dbentry and tabentry at > * once. Update dbentry only if we could update tabentry. > */ > if (pgstat_update_tabentry(tabhash, entry, nowait)) > { > pgstat_update_dbentry(dbent, entry); > updated = true; > } At this point we're very deeply nested. pgstat_report_stat() -> pgstat_flush_stat() -> pgstat_flush_tabstat() -> pgstat_update_tabentry(). That's way over the top imo. I don't think it makes much sense that pgstat_update_dbentry() is called separately for each table. Why would we want to constantly lock that entry? It seems to be much more sensible to instead have pgstat_flush_stat() transfer the stats it reported to the pending database wide counters, and then report that to shared memory *once* per pgstat_report_stat() with pgstat_flush_dbstats()? > /* > * pgstat_flush_dbstats: Flushes out miscellaneous database stats. > * > * If nowait is true, returns with false on lock failure on dbentry. > * > * Returns true if all stats are flushed out. > */ > static bool > pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait) > { > /* get dbentry if not yet */ > if (cxt->mydbentry == NULL) > { > int op = PGSTAT_EXCLUSIVE; > if (nowait) > op |= PGSTAT_NOWAIT; > > cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL); > > /* return if lock failed. */ > if (cxt->mydbentry == NULL) > return false; > > /* we use this generation of table /function stats in this turn */ > cxt->mygeneration = pin_hashes(cxt->mydbentry); > } > > LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE); > if (HAVE_PENDING_CONFLICTS()) > pgstat_flush_recovery_conflict(cxt->mydbentry); > if (BeDBStats.n_deadlocks != 0) > pgstat_flush_deadlock(cxt->mydbentry); > if (BeDBStats.n_tmpfiles != 0) > pgstat_flush_tempfile(cxt->mydbentry); > if (BeDBStats.checksum_failures != NULL) > pgstat_flush_checksum_failure(cxt->mydbentry); > LWLockRelease(&cxt->mydbentry->lock); What's the point of having all these sub-functions? I see that you, for an undocumented reason, have pgstat_report_recovery_conflict() flush conflict stats immediately: > dbentry = pgstat_get_db_entry(MyDatabaseId, > PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT, > &status); > > if (status == LOCK_FAILED) > return; > > /* We had a chance to flush immediately */ > pgstat_flush_recovery_conflict(dbentry); > > dshash_release_lock(pgStatDBHash, dbentry); But I don't understand why? Nor why we'd not just report all pending database wide changes in that case? The fact that you're locking the per-database entry unconditionally once for each table almost guarantees contention - and you're not using the 'conditional lock' approach for that. I don't understand. > /* ---------- > * pgstat_vacuum_stat() - > * > * Remove objects we can get rid of. > * ---------- > */ > void > pgstat_vacuum_stat(void) > { > HTAB *oidtab; > dshash_seq_status dshstat; > PgStat_StatDBEntry *dbentry; > > /* we don't collect stats under standalone mode */ > if (!IsUnderPostmaster) > return; > > /* > * Read pg_database and make a list of OIDs of all existing databases > */ > oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid); > > /* > * Search the database hash table for dead databases and drop them > * from the hash. > */ > > dshash_seq_init(&dshstat, pgStatDBHash, false, true); > while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL) > { > Oid dbid = dbentry->databaseid; > > CHECK_FOR_INTERRUPTS(); > > /* the DB entry for shared tables (with InvalidOid) is never dropped */ > if (OidIsValid(dbid) && > hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL) > pgstat_drop_database(dbid); > } > > /* Clean up */ > hash_destroy(oidtab); So, uh, pgstat_drop_database() again does a *separate* lookup in the dshash, locking the entry. Which only works because you added this dirty hack: /* We need to keep partition lock while sequential scan */ if (!hash_table->seqscan_running) { hash_table->find_locked = false; hash_table->find_exclusively_locked = false; LWLockRelease(PARTITION_LOCK(hash_table, partition)); } to dshash_delete_entry(). This seems insane to me. There's not even a comment explaining this? > /* > * Similarly to above, make a list of all known relations in this DB. > */ > oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid); > > /* > * Check for all tables listed in stats hashtable if they still exist. > * Stats cache is useless here so directly search the shared hash. > */ > pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab); > > /* > * Repeat the above but we needn't bother in the common case where no > * function stats are being collected. > */ > if (dbentry->functions != DSM_HANDLE_INVALID) > { > oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid); > > pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams, > oidtab); > } > dshash_release_lock(pgStatDBHash, dbentry); Wait, why are we holding the database partition lock across all this? Again without any comments explaining why? > +void > +pgstat_send_archiver(const char *xlog, bool failed) Why do we still have functions named pgstat_send*? Greetings, Andres Freund
pgsql-hackers by date: