Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers
From | Soumyadeep Chakraborty |
---|---|
Subject | Re: "ERROR: latch already owned" on gharial |
Date | |
Msg-id | CAE-ML+_CL3TfhLo6MjCSufinyugqSJWr8qEoWL8oAc-oT+P67g@mail.gmail.com Whole thread Raw |
In response to | Re: "ERROR: latch already owned" on gharial (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: "ERROR: latch already owned" on gharial
|
List | pgsql-hackers |
Hey hackers, I wanted to report that we have seen this issue (with the procLatch) a few times very sporadically on Greenplum 6X (based on 9.4), with relatively newer versions of GCC. I realize that 9.4 is out of support, so this email is purely to add on to the existing thread, in case the info can help fix/reveal something in supported versions. Unfortunately, we don't have a core to share as we don't have the benefit of commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog ERROR as opposed to PANIC. Instance 1: Event 1: 2023-11-13 10:01:31.927168 CET..., pY, ..."LOG","00000","disconnection: session time: ..." Event 2: 2023-11-13 10:01:32.049135 CET...,pX,,,,,"FATAL","XX000","latch already owned by pid Y (is_set: 0) (pg_latch.c:159)",,,,,,,0,, "pg_latch.c",159,"Stack trace: 1 0xbde8b8 postgres errstart (elog.c:567) 2 0xbe0768 postgres elog_finish (discriminator 7) 3 0xa08924 postgres <symbol not found> (pg_latch.c:158) <---------- OwnLatch 4 0xa7f179 postgres InitProcess (proc.c:523) 5 0xa94ac3 postgres PostgresMain (postgres.c:4874) 6 0xa1e2ed postgres <symbol not found> (postmaster.c:2860) 7 0xa1f295 postgres PostmasterMain (discriminator 5) ... "LOG","00000","server process (PID Y) exited with exit code 1",,,,,,,0,,"postmaster.c",3987, Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)): Exactly the same as Instance 1 with identical log, ordering of events and stack trace, except this time (is_set: 1) when the ERROR is logged. A possible ordering of events: (1) DisownLatch() is called by pid Y during ProcKill() and the write for latch->owner_pid = 0 is NOT yet flushed to shmem. (2) The PGPROC object for pid Y is returned to the free list. (3) Pid X sees the same PGPROC object on the free list and grabs it. (4) Pid X does sanity check inside OwnLatch during InitProcess and still sees the old value of latch->owner_pid = Y (and not = 0), and trips the ERROR. The above sequence of operations should apply to PG HEAD as well. Suggestion: Should we do a pg_memory_barrier() at the end of DisownLatch(), like in ResetLatch(), like the one introduced in [3]? This would ensure that the write latch->owner_pid = 0; is flushed to shmem. The attached patch does this. I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3]. I didn't find anything in the associated hackers thread [4] either. Was it the performance impact, or was it just because SetLatch and ResetLatch were more racy and this is way less likely to happen? This is out of my wheelhouse, but would one additional barrier in a process' lifecycle be that bad for performance? Appendix: Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20) CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -m64 -O3 -fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99 -Werror=uninitialized -Werror=implicit-function-declaration Regards, Soumyadeep (VMware) [1] https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99 [2] https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050 [3] https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529 [4] https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de
Attachment
pgsql-hackers by date: