Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | cbe6eb783affb3e9b5068f5cf019ee62@oss.nttdata.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (James Coleman <jtc331@gmail.com>) |
Responses |
Re: RFC: Logging plan of the running query
Re: RFC: Logging plan of the running query |
List | pgsql-hackers |
On 2023-09-06 11:17, James Coleman wrote: >> > I've never been able to reproduce it (haven't tested the new version, >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian >> > (first buster and now bullseye). >> > >> > I'm attaching several stacktraces in the hope that they provide some >> > clues. These all match the ps output I sent earlier, though note in >> > that output there is both the regress instance and my test instance >> > (pid 3213249) running (different ports, of course, and they are from >> > the exact same compilation run). I've attached ps output for the >> > postgres processes under the make check process to simplify cross >> > referencing. >> > >> > A few interesting things: >> > - There's definitely a lock on a relation that seems to be what's >> > blocking the processes. >> > - When I try to connect with psql the process forks but then hangs >> > (see the ps output with task names stuck in "authentication"). I've >> > also included a trace from one of these. >> >> Thanks for sharing them! >> >> Many processes are waiting to acquire the LW lock, including the >> process >> trying to output the plan(select1.trace). >> >> I suspect that this is due to a lock that was acquired prior to being >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able >> to >> reproduce the same situation.. >> > > I don't have time immediately to dig into this, but perhaps loading up > the core dumps would allow us to see what query is running in each > backend process (if it hasn't already been discarded by that point) > and thereby determine what point in each test process led to the error > condition? Thanks for the suggestion. I am concerned that core dumps may not be readable on different operating systems or other environments. (Unfortunately, I do not have Debian on hand) It seems that we can know what queries were running from the stack traces you shared. As described above, I suspect a lock which was acquired prior to ProcessLogQueryPlanInterrupt() caused the issue. I will try a little more to see if I can devise a way to create the same situation. > Alternatively we might be able to apply the same trick to the test > client instead... > > BTW, for my own easy reference in this thread: relid 1259 is pg_class > if I'm not mistaken. Yeah, and I think it's strange that the lock to 1259 appears twice and must be avoided. #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117 .. #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at explain.c:5158 .. #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117 -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
pgsql-hackers by date: