Home > mailing lists
Re: shared-memory based stats collector - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: shared-memory based stats collector
Date	March 13, 2020 03:13:24
Msg-id	20200313031324.4al3nnwxnzl2j4sb@alap3.anarazel.de Whole thread Raw
In response to	Re: shared-memory based stats collector (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses	Re: shared-memory based stats collector Re: shared-memory based stats collector
List	pgsql-hackers
Tree view
Hi,

Thomas, could you look at the first two patches here, and my review
questions?


General comments about this series:
- A lot of the review comments feel like I've written them before, a
  year or more ago. I feel this patch ought to be in a much better
  state. There's a lot of IMO fairly obvious stuff here, and things that
  have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
  a lot of these can be found with the simplest spellchecker.  That's
  one thing for a patch that just has been hacked up as a POC, but this
  is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.


On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Fri, 29 Jun 2018 16:41:04 +0900
> Subject: [PATCH 1/5] sequential scan for dshash
>
> Add sequential scan feature to dshash.


>          dsa_pointer item_pointer = hash_table->buckets[i];
> @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
>                                  LW_EXCLUSIVE));
>
>      delete_item(hash_table, item);
> -    hash_table->find_locked = false;
> -    hash_table->find_exclusively_locked = false;
> -    LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +
> +    /* We need to keep partition lock while sequential scan */
> +    if (!hash_table->seqscan_running)
> +    {
> +        hash_table->find_locked = false;
> +        hash_table->find_exclusively_locked = false;
> +        LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +    }
>  }

This seems like a failure prone API.

>  /*
> @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
>      Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
>                                  hash_table->find_exclusively_locked
>                                  ? LW_EXCLUSIVE : LW_SHARED));
> +    /* lock is under control of sequential scan */
> +    Assert(!hash_table->seqscan_running);
>
>      hash_table->find_locked = false;
>      hash_table->find_exclusively_locked = false;
> @@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
>      return tag_hash(v, size);
>  }
>
> +/*
> + * dshash_seq_init/_next/_term
> + *           Sequentially scan trhough dshash table and return all the
> + *           elements one by one, return NULL when no more.

s/trhough/through/

This uses a different comment style that the other functions in this
file. Why?


> + * dshash_seq_term should be called for incomplete scans and otherwise
> + * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.


> + * Returned elements are locked as is the case with dshash_find.  However, the
> + * caller must not release the lock.
> + *
> + * Same as dynanash, the caller may delete returned elements midst of a scan.

I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.


> + * If consistent is set for dshash_seq_init, the all hash table partitions are
> + * locked in the requested mode (as determined by the exclusive flag) during
> + * the scan.  Otherwise partitions are locked in one-at-a-time way during the
> + * scan.

Yet delete unconditionally retains locks?


> + */
> +void
> +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> +                bool consistent, bool exclusive)
> +{

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.


> @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
>  extern void dshash_detach(dshash_table *hash_table);
>  extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
>  extern void dshash_destroy(dshash_table *hash_table);
> -
>  /* Finding, creating, deleting entries. */
>  extern void *dshash_find(dshash_table *hash_table,
>                           const void *key, bool
>  exclusive);

There's a number of spurious changes like this.



> From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 27 Sep 2018 11:15:19 +0900
> Subject: [PATCH 2/5] Add conditional lock feature to dshash
>
> Dshash currently waits for lock unconditionally. This commit adds new
> interfaces for dshash_find and dshash_find_or_insert. The new
> interfaces have an extra parameter "nowait" taht commands not to wait
> for lock.

s/taht/that/

There should be at least a sentence or two explaining why these are
useful.


> +/*
> + * The version of dshash_find, which is allowed to return immediately on lock
> + * failure. Lock status is set to *lock_failed in that case.
> + */

Hm. Not sure I like the *lock_acquired API.

> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> +                     bool exclusive, bool nowait, bool *lock_acquired)
>  {
>      dshash_hash hash;
>      size_t        partition;
>      dshash_table_item *item;
>
> +    /*
> +     * No need to return lock resut when !nowait. Otherwise the caller may
> +     * omit the lock result when NULL is returned.
> +     */
> +    Assert(nowait || !lock_acquired);
> +
>      hash = hash_key(hash_table, key);
>      partition = PARTITION_FOR_HASH(hash);
>
>      Assert(hash_table->control->magic == DSHASH_MAGIC);
>      Assert(!hash_table->find_locked);
>
> -    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
> -                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
> +    if (nowait)
> +    {
> +        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
> +                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
> +        {
> +            if (lock_acquired)
> +                *lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?


I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).


> +/*
> + * The version of dshash_find_or_insert, which is allowed to return immediately
> + * on lock failure.
> + *
> + * Notes above dshash_find_extended() regarding locking and error handling
> + * equally apply here.

They don't, there's no lock_acquired parameter.

> + */
> +void *
> +dshash_find_or_insert_extended(dshash_table *hash_table,
> +                               const void *key,
> +                               bool *found,
> +                               bool nowait)

I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.


> From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Wed, 7 Nov 2018 16:53:49 +0900
> Subject: [PATCH 3/5] Make archiver process an auxiliary process
>
> This is a preliminary patch for shared-memory based stats collector.
> Archiver process must be a auxiliary process since it uses shared
> memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

> an auxiliary process in order to make it work.

>

> @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
>              StartupProcessMain();
>              proc_exit(1);        /* should never return */
>
> +        case ArchiverProcess:
> +            /* don't set signals, archiver has its own agenda */
> +            PgArchiverMain();
> +            proc_exit(1);        /* should never return */
> +
>          case BgWriterProcess:
>              /* don't set signals, bgwriter has its own agenda */
>              BackgroundWriterMain();

I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.


>  /* ------------------------------------------------------------
>   * Local functions called by archiver follow
>   * ------------------------------------------------------------
> @@ -219,8 +148,8 @@ pgarch_forkexec(void)
>   *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
>   *    since we don't use 'em, it hardly matters...
>   */
> -NON_EXEC_STATIC void
> -PgArchiverMain(int argc, char *argv[])
> +void
> +PgArchiverMain(void)
>  {
>      /*
>       * Ignore all signals usually bound to some action in the postmaster,
> @@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
>  static void
>  pgarch_exit(SIGNAL_ARGS)
>  {
> -    /* SIGQUIT means curl up and die ... */
> -    exit(1);
> +    PG_SETMASK(&BlockSig);
> +
> +    /*
> +     * We DO NOT want to run proc_exit() callbacks -- we're here because
> +     * shared memory may be corrupted, so we don't want to try to clean up our
> +     * transaction.  Just nail the windows shut and get out of town.  Now that
> +     * there's an atexit callback to prevent third-party code from breaking
> +     * things by calling exit() directly, we have to reset the callbacks
> +     * explicitly to make this work as intended.
> +     */
> +    on_exit_reset();
> +
> +    /*
> +     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
> +     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
> +     * process.  This is necessary precisely because we don't clean up our
> +     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
> +     * should ensure the postmaster sees this as a crash, too, but no harm in
> +     * being doubly sure.)
> +     */
> +    exit(2);
>  }
>

This seems to be a copy of code & comments from other signal handlers that predates

commit 8e19a82640d3fa2350db146ec72916856dd02f0a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2018-08-08 19:08:10 +0300

    Don't run atexit callbacks in quickdie signal handlers.


I think this just should use SignalHandlerForCrashExit().


I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?


>  /* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

> @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
>
>      switch (backendType)
>      {
> +        case B_ARCHIVER:
> +            backendDesc = "archiver";
> +            break;

should imo include 'WAL' or such.



> From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 4/5] Shared-memory based stats collector
>
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.


> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 0bfd6151c4..a6b0bdec12 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
>  postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
>  postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
>  postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
> -postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
>  postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
>  postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
>  postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
> @@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
>     master server process.  The command arguments
>     shown for it are the same ones used when it was launched.  The next five
>     processes are background worker processes automatically launched by the
> -   master process.  (The <quote>stats collector</quote> process will not be present
> -   if you have set the system not to start the statistics collector; likewise
> -   the <quote>autovacuum launcher</quote> process can be disabled.)
> +   master process.  (The <quote>autovacuum launcher</quote> process will not
> +   be present if you have set the system not to start it.)
>     Each of the remaining
>     processes is a server process handling one client connection.  Each such
>     process sets its command line display in the form

There's more references to the stats collector than this... E.g. in
catalogs.sgml

   <xref linkend="view-table"/> lists the system views described here.
   More detailed documentation of each view follows below.
   There are some additional views that provide access to the results of
   the statistics collector; they are described in <xref
   linkend="monitoring-stats-views-table"/>.
  </para>


> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 6d1f28c327..8dcb0fb7f7 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1956,15 +1956,15 @@ do_autovacuum(void)
>                                            ALLOCSET_DEFAULT_SIZES);
>      MemoryContextSwitchTo(AutovacMemCxt);
>
> +    /* Start a transaction so our commands have one to play into. */
> +    StartTransactionCommand();
> +
>      /*
>       * may be NULL if we couldn't find an entry (only happens if we are
>       * forcing a vacuum for anti-wrap purposes).
>       */
>      dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
>
> -    /* Start a transaction so our commands have one to play into. */
> -    StartTransactionCommand();
> -
>      /*
>       * Clean up any dead statistics collector entries for this DB. We always
>       * want to do this exactly once per DB-processing cycle, even if we find
> @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
>      if (isshared)
>      {
>          if (PointerIsValid(shared))
> -            tabentry = hash_search(shared->tables, &relid,
> -                                   HASH_FIND, NULL);
> +            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
>      }
>      else if (PointerIsValid(dbentry))
> -        tabentry = hash_search(dbentry->tables, &relid,
> -                               HASH_FIND, NULL);
> +        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
>
>      return tabentry;
>  }

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.



> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index ca5c6376e5..1ffe073a1f 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -1,15 +1,23 @@
>  /* ----------
>   * pgstat.c
>   *
> - *    All the statistics collector stuff hacked up in one big, ugly file.
> + *    Statistics collector facility.
>   *
> - *    TODO:    - Separate collector, postmaster and backend stuff
> - *              into different files.
> + *  Collects per-table and per-function usage statistics of all backends on
> + *  shared memory. pg_count_*() and friends are the interface to locally store
> + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> + *  at the end of a transaction to pulish the local stats on shared memory.
>   *

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.


> - *            - Add some automatic call for pgstat vacuuming.
> + *  To avoid congestion on the shared memory, we update shared stats no more
> + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> + *  all the local numbers cannot be flushed immediately, we postpone updates
> + *  and try the next chance after the interval of
> + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.


> +/*
> + * Operation mode and return code of pgstat_get_db_entry.
> + */
> +#define    PGSTAT_SHARED        0

This is unreferenced.


> +#define    PGSTAT_EXCLUSIVE    1
> +#define    PGSTAT_NOWAIT        2

And these should imo rather be parameters.


> +typedef enum PgStat_TableLookupResult
> +{
> +    NOT_FOUND,
> +    FOUND,
> +    LOCK_FAILED
> +} PgStat_TableLookupResult;

This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

> +#define        StatsLock (&StatsShmem->StatsMainLock)
>
> -static time_t last_pgstat_start_time;
> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct
> +{
> +    LWLock                StatsMainLock;        /* lock to protect this struct */
> +    dsa_handle             stats_dsa_handle;    /* DSA handle for stats data */
> +    dshash_table_handle db_hash_handle;
> +    dsa_pointer            global_stats;
> +    dsa_pointer            archiver_stats;
> +    int                    refcount;
> +} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?


> +/*
> + * BgWriter global statistics counters. The name cntains a remnant from the
> + * time when the stats collector was a dedicate process, which used sockets to
> + * send it.
> + */
> +PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.


> +/* common header of snapshot entry in reader snapshot hash */
> +typedef struct PgStat_snapshot
> +{
> +    Oid        key;
> +    bool    negative;
> +    void   *body;                /* end of header part: to keep alignment */
> +} PgStat_snapshot;


> +/* context struct for snapshot_statentry */
> +typedef struct pgstat_snapshot_param
> +{
> +    char           *hash_name;            /* name of the snapshot hash */
> +    int                hash_entsize;        /* element size of hash entry */
> +    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
> +    const dshash_parameters *dsh_params;/* dshash params */
> +    HTAB          **hash;                /* points to variable to hold hash */
> +    dshash_table  **dshash;                /* ditto for dshash */
> +} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!


> +/*
> + * Backends store various database-wide info that's waiting to be flushed out
> + * to shared memory in these variables.
> + *
> + * checksum_failures is the exception in that it is cluster-wide value.
> + */
> +typedef struct BackendDBStats
> +{
> +    int        n_conflict_tablespace;
> +    int        n_conflict_lock;
> +    int        n_conflict_snapshot;
> +    int        n_conflict_bufferpin;
> +    int        n_conflict_startup_deadlock;
> +    int        n_deadlocks;
> +    size_t    n_tmpfiles;
> +    size_t    tmpfilesize;
> +    HTAB    *checksum_failures;
> +} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.


> +    if (StatsShmem->refcount > 0)
> +        StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.


To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)a
{

    /*
     * We could directly remove files and recreate the shared memory area. But
     * detach then attach for simplicity.
     */
    pgstat_detach_shared_stats(false);    /* Don't write */
    pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Nor would the dynamic re-creation of the db dshash table be needed.


> +/* ----------
> + * pgstat_report_stat() -
> + *
> + *    Must be called by processes that performs DML: tcop/postgres.c, logical
> + *    receiver processes, SPI worker, etc. to apply the so far collected
> + *    per-table and function usage statistics to the shared statistics hashes.
> + *
> + *  Updates are applied not more frequent than the interval of
> + *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + *  failure if force is false and there's no pending updates longer than
> + *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + *  succeeding calls of this function.
> + *
> + *    Returns the time until the next timing when updates are applied in
> + *    milliseconds if there are no updates holded for more than
> + *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
> + *
> + *    Note that this is called only out of a transaction, so it is fine to use
> + *    transaction stop time as an approximation of current time.
> + *    ----------
> + */

Inconsistent indentation.

> +long
> +pgstat_report_stat(bool force)
>  {

> +    /* Flush out table stats */
> +    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> +        pending_stats = true;
> +
> +    /* Flush out function stats */
> +    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> +        pending_stats = true;

This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.


> -        snprintf(fname, sizeof(fname), "%s/%s", directory,
> -                 entry->d_name);
> -        unlink(fname);
> +    /* Flush out database-wide stats */
> +    if (HAVE_PENDING_DBSTATS())
> +    {
> +        if (!pgstat_flush_dbstats(&cxt, !force))
> +            pending_stats = true;
>      }

Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?


> /*
>  * snapshot_statentry() - Common routine for functions
>  *                             pgstat_fetch_stat_*entry()
>  *

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?


>  *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
>  *  not found. Returned snapshots are stable during the current transaction or
>  *  until pgstat_clear_snapshot() is called.
>  *
>  *  The snapshots are stored in a hash, pointer to which is stored in the
>  *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
>  *  using hash_name, hash_entsize in cxt.
>  *
>  *  cxt->dshash points to dshash_table for dbstat entries. If not yet
>  *  attached, it is attached using cxt->dsh_handle.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:


>   /*
>    * We don't want so frequent update of stats snapshot. Keep it at least
>    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>    */
>   if (clear_snapshot)
>   {
>       clear_snapshot = false;
>
>       if (pgStatSnapshotContext &&
>           snapshot_globalStats.stats_timestamp <
>           GetCurrentStatementStartTimestamp() -
>           PGSTAT_STAT_MIN_INTERVAL * 1000)
>       {
>           MemoryContextReset(pgStatSnapshotContext);
>
>           /* Reset variables */
>           global_snapshot_is_valid = false;
>           pgStatSnapshotContext = NULL;
>           pgStatLocalHash = NULL;
>
>           pgstat_setup_memcxt();
>       }
>   }

I think we should just remove this entire local caching snapshot layer
for lookups.


> /*
>  * pgstat_flush_stat: Flushes table stats out to shared statistics.
>  *

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.



>  *  If nowait is true, returns false if required lock was not acquired
>  *  immediately. In that case, unapplied table stats updates are left alone in
>  *  TabStatusArray to wait for the next chance. cxt holds some dshash related
>  *  values that we want to carry around while updating shared stats.
>  *
>  *  Returns true if all stats info are flushed. Caller must detach dshashes
>  *  stored in cxt after use.
>  */
> static bool
> pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   static const PgStat_TableCounts all_zeroes;
>   TabStatusArray *tsa;
>   HTAB           *new_tsa_hash = NULL;
>   TabStatusArray *dest_tsa = pgStatTabList;
>   int             dest_elem = 0;
>   int             i;
>
>   /* nothing to do, just return */
>   if (pgStatTabHash == NULL)
>       return true;
>
>   /*
>    * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
>    * entries it points to.
>    */
>   hash_destroy(pgStatTabHash);
>   pgStatTabHash = NULL;
>
>   /*
>    * Scan through the TabStatusArray struct(s) to find tables that actually
>    * have counts, and try flushing it out to shared stats. We may fail on
>    * some entries in the array. Leaving the entries being packed at the
>    * beginning of the array.
>    */
>   for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
>   {

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).


>       for (i = 0; i < tsa->tsa_used; i++)
>       {
>           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
>
>           /* Shouldn't have any pending transaction-dependent counts */
>           Assert(entry->trans == NULL);
>
>           /*
>            * Ignore entries that didn't accumulate any actual counts, such
>            * as indexes that were opened by the planner but not used.
>            */
>           if (memcmp(&entry->t_counts, &all_zeroes,
>                      sizeof(PgStat_TableCounts)) == 0)
>               continue;
>
>           /* try to apply the tab stats */
>           if (!pgstat_flush_tabstat(cxt, nowait, entry))
>           {
>               /*
>                * Failed. Move it to the beginning in TabStatusArray and
>                * leave it.
>                */
>               TabStatHashEntry *hash_entry;
>               bool found;
>
>               if (new_tsa_hash == NULL)
>                   new_tsa_hash = create_tabstat_hash();
>
>               /* Create hash entry for this entry */
>               hash_entry = hash_search(new_tsa_hash, &entry->t_id,
>                                        HASH_ENTER, &found);
>               Assert(!found);
>
>               /*
>                * Move insertion pointer to the next segment if the segment
>                * is filled up.
>                */
>               if (dest_elem >= TABSTAT_QUANTUM)
>               {
>                   Assert(dest_tsa->tsa_next != NULL);
>                   dest_tsa = dest_tsa->tsa_next;
>                   dest_elem = 0;
>               }
>
>               /*
>                * Pack the entry at the begining of the array. Do nothing if
>                * no need to be moved.
>                */
>               if (tsa != dest_tsa || i != dest_elem)
>               {
>                   PgStat_TableStatus *new_entry;
>                   new_entry = &dest_tsa->tsa_entries[dest_elem];
>                   *new_entry = *entry;
>
>                   /* use new_entry as entry hereafter */
>                   entry = new_entry;
>               }
>
>               hash_entry->tsa_entry = entry;
>               dest_elem++;
>           }

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.


> bool
> pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
>                    PgStat_TableStatus *entry)
> {
>   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
>   int     table_mode = PGSTAT_EXCLUSIVE;
>   bool    updated = false;
>   dshash_table *tabhash;
>   PgStat_StatDBEntry *dbent;
>   int     generation;
>
>   if (nowait)
>       table_mode |= PGSTAT_NOWAIT;
>
>   /* Attach required table hash if not yet. */
>   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
>   {
>       /*
>        *  Return if we don't have corresponding dbentry. It would've been
>        *  removed.
>        */
>       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
>       if (!dbent)
>           return false;
>
>       /*
>        * We don't hold lock on the dbentry since it cannot be dropped while
>        * we are working on it.
>        */
>       generation = pin_hashes(dbent);
>       tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.


>       if (entry->t_shared)
>       {
>           cxt->shgeneration = generation;
>           cxt->shdbentry = dbent;
>           cxt->shdb_tabhash = tabhash;
>       }
>       else
>       {
>           cxt->mygeneration = generation;
>           cxt->mydbentry = dbent;
>           cxt->mydb_tabhash = tabhash;
>
>           /*
>            * We come here once per database. Take the chance to update
>            * database-wide stats
>            */
>           LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
>           dbent->n_xact_commit += pgStatXactCommit;
>           dbent->n_xact_rollback += pgStatXactRollback;
>           dbent->n_block_read_time += pgStatBlockReadTime;
>           dbent->n_block_write_time += pgStatBlockWriteTime;
>           LWLockRelease(&dbent->lock);
>           pgStatXactCommit = 0;
>           pgStatXactRollback = 0;
>           pgStatBlockReadTime = 0;
>           pgStatBlockWriteTime = 0;
>       }
>   }
>   else if (entry->t_shared)
>   {
>       dbent = cxt->shdbentry;
>       tabhash = cxt->shdb_tabhash;
>   }
>   else
>   {
>       dbent = cxt->mydbentry;
>       tabhash = cxt->mydb_tabhash;
>   }
>
>
>   /*
>    * Local table stats should be applied to both dbentry and tabentry at
>    * once. Update dbentry only if we could update tabentry.
>    */
>   if (pgstat_update_tabentry(tabhash, entry, nowait))
>   {
>       pgstat_update_dbentry(dbent, entry);
>       updated = true;
>   }

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.


I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?


> /*
>  * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
>  *
>  *  If nowait is true, returns with false on lock failure on dbentry.
>  *
>  *  Returns true if all stats are flushed out.
>  */
> static bool
> pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   /* get dbentry if not yet */
>   if (cxt->mydbentry == NULL)
>   {
>       int op = PGSTAT_EXCLUSIVE;
>       if (nowait)
>           op |= PGSTAT_NOWAIT;
>
>       cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
>
>       /* return if lock failed. */
>       if (cxt->mydbentry == NULL)
>           return false;
>
>       /* we use this generation of table /function stats in this turn */
>       cxt->mygeneration = pin_hashes(cxt->mydbentry);
>   }
>
>   LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
>   if (HAVE_PENDING_CONFLICTS())
>       pgstat_flush_recovery_conflict(cxt->mydbentry);
>   if (BeDBStats.n_deadlocks != 0)
>       pgstat_flush_deadlock(cxt->mydbentry);
>   if (BeDBStats.n_tmpfiles != 0)
>       pgstat_flush_tempfile(cxt->mydbentry);
>   if (BeDBStats.checksum_failures != NULL)
>       pgstat_flush_checksum_failure(cxt->mydbentry);
>   LWLockRelease(&cxt->mydbentry->lock);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

>   dbentry = pgstat_get_db_entry(MyDatabaseId,
>                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
>                                 &status);
>
>   if (status == LOCK_FAILED)
>       return;
>
>   /* We had a chance to flush immediately */
>   pgstat_flush_recovery_conflict(dbentry);
>
>   dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.



> /* ----------
>  * pgstat_vacuum_stat() -
>  *
>  *    Remove objects we can get rid of.
>  * ----------
>  */
> void
> pgstat_vacuum_stat(void)
> {
>   HTAB       *oidtab;
>   dshash_seq_status dshstat;
>   PgStat_StatDBEntry *dbentry;
>
>   /* we don't collect stats under standalone mode */
>   if (!IsUnderPostmaster)
>       return;
>
>   /*
>    * Read pg_database and make a list of OIDs of all existing databases
>    */
>   oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
>
>   /*
>    * Search the database hash table for dead databases and drop them
>    * from the hash.
>    */
>
>   dshash_seq_init(&dshstat, pgStatDBHash, false, true);
>   while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
>   {
>       Oid         dbid = dbentry->databaseid;
>
>       CHECK_FOR_INTERRUPTS();
>
>       /* the DB entry for shared tables (with InvalidOid) is never dropped */
>       if (OidIsValid(dbid) &&
>           hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
>           pgstat_drop_database(dbid);
>   }
>
>   /* Clean up */
>   hash_destroy(oidtab);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

    /* We need to keep partition lock while sequential scan */
    if (!hash_table->seqscan_running)
    {
        hash_table->find_locked = false;
        hash_table->find_exclusively_locked = false;
        LWLockRelease(PARTITION_LOCK(hash_table, partition));
    }

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?


>   /*
>    * Similarly to above, make a list of all known relations in this DB.
>    */
>   oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
>
>   /*
>    * Check for all tables listed in stats hashtable if they still exist.
>    * Stats cache is useless here so directly search the shared hash.
>    */
>   pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
>
>   /*
>    * Repeat the above but we needn't bother in the common case where no
>    * function stats are being collected.
>    */
>   if (dbentry->functions != DSM_HANDLE_INVALID)
>   {
>       oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
>
>       pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
>                                     oidtab);
>   }
>   dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?


> +void
> +pgstat_send_archiver(const char *xlog, bool failed)

Why do we still have functions named pgstat_send*?


Greetings,

Andres Freund
pgsql-hackers by date:
From: Dilip Kumar
Date: 13 March 2020, 03:12:05
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
From: "movead.li@highgo.ca"
Date: 13 March 2020, 03:18:47
Subject: Re: Re: A bug when use get_bit() function for a long bytea string
Re: shared-memory based stats collector - Mailing list pgsql-hackers

Previous

Next