Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Date
Msg-id CAOYmi+kUkCkSV13-a0b0rwmtge_uZnPKQx-yLFQJdfoceWFH2g@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible  (Andres Freund <andres@anarazel.de>)
Responses Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
List pgsql-hackers
On Thu, Nov 7, 2024 at 10:12 AM Andres Freund <andres@anarazel.de> wrote:
> I don't understand why the pgstat_bestart()/pgstat_bestart_pre_auth() split
> makes sense. The latter is going to redo most of the work that the former
> did. What's the point of that?
>
> Why not have a new function that initializes just the missing additional
> information? Or for that matter, why not move most of what pgstat_bestart()
> does into pgstat_beinit()?

I talk about that up above [1]. I agree that this is all complicated
and fragile, but at the moment, I think splitting things apart is not
going to reduce the complexity in any way. I'm all ears for a
different approach, though (and it sounds like Michael is taking a
stab at it too).

> This doesn't really seem like it's actually using wait events to describe
> waits. The new wait events cover stuff like memory allocations etc, see
> e.g. pg_SSPI_make_upn().

I've also asked about the "scope" of the waits in the OP [2]. I can
move them downwards in the stack, if you'd prefer.

All of these are intended to cover parts of the code that can actually
hang, but for things like SSPI I'm just working off of inspection and
Win32 documentation. So if it's not actually true that some of these
call points can hang, let me know and I can remove them. (For the
particular example you called out, I'm just trying to cover both calls
to TranslateName() in a maintainable place. The documentation says
"TranslateName fails if it cannot bind to Active Directory on a domain
controller." which seemed pretty wait-worthy to me.)

> This isn't just pedantry - all the relevant code really needs to be rewritten
> to allow the blocking to happen in an interruptible way, otherwise
> authentication timeout etc can't realiably work. Once that's done you can
> actually define useful wait events too.

I agree that would be amazing! I'm not about to tackle reliable
interrupts for all of the current blocking auth code for v18, though.
I'm just trying to make it observable when we do something that
blocks.

--Jacob

[1] https://www.postgresql.org/message-id/CAOYmi%2BkLzSWrDHZbJg8bWZ94oP_K98mkoEvetgupOBVoy5H_ag%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAOYmi%2B%3D60deN20WDyCoHCiecgivJxr%3D98s7s7-C8SkXwrCfHXg%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Next
From: Robert Haas
Date:
Subject: Re: magical eref alias names