Re: backup manifests and contemporaneous buildfarm failures - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: backup manifests and contemporaneous buildfarm failures |
Date | |
Msg-id | 20200404030628.rlyvz2n76g7g26ea@alap3.anarazel.de Whole thread Raw |
In response to | Re: backup manifests and contemporaneous buildfarm failures (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: backup manifests and contemporaneous buildfarm failures
Re: backup manifests and contemporaneous buildfarm failures |
List | pgsql-hackers |
Hi, Peter, Petr, CCed you because it's probably a bug somewhere around the initial copy code for logical replication. On 2020-04-03 20:48:09 -0400, Robert Haas wrote: > 'serinus' is also failing. This is less obviously related: Hm. Tests passed once since then. > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received > replication command: CREATE_REPLICATION_SLOT > "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication > slot "tap_sub_16390_sync_16384" already exists That already seems suspicious. I checked the following (successful) run and I did not see that in the stage's logs. Looking at the failing log, it fails because for some reason there's rounds (once due to a refresh, once due to an intention replication failure) of copying the relation. Each creates its own temporary slot. first time: 2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG: statement: COPY public.tab_rep TO STDOUT 2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG: disconnection: session time: 0:00:00.007 user=bf database=postgreshost=[local] second time: 2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG: statement: COPY public.tab_rep TO STDOUT third time: 2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication slot "tap_sub_16390_sync_16384" already exists Note that the connection from the second attempt has not yet disconnected. Hence the error about the replication slot already existing - it's a temporary replication slot that'd otherwise already have been dropped. Seems the logical rep code needs to do something about this race? About the assertion failure: TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File: "/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c",Line: 718) postgres: publisher: walsender bf [local] idle in transaction (aborted)(ExceptionalCondition+0x5c)[0x9a13ac] postgres: publisher: walsender bf [local] idle in transaction (aborted)(ResourceOwnerDelete+0x295)[0x9db8e5] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x54c61f] postgres: publisher: walsender bf [local] idle in transaction (aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x9b3bc9] postgres: publisher: walsender bf [local] idle in transaction (aborted)(shmem_exit+0x35)[0x80db45] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x80dc77] postgres: publisher: walsender bf [local] idle in transaction (aborted)(proc_exit+0x8)[0x80dd08] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostgresMain+0x59f)[0x83bd0f] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x7a0264] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostmasterMain+0xbfc)[0x7a2b8c] postgres: publisher: walsender bf [local] idle in transaction (aborted)(main+0x6fb)[0x49749b] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83bbbb] postgres: publisher: walsender bf [local] idle in transaction (aborted)(_start+0x2a)[0x49753a] 2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG: server process (PID 329409) was terminated by signal 6: Aborted Due to the log_line_prefix used, I was at first not entirely sure the backend that crashed was the one with the ERROR. But it appears we print the pid as hex for '%c' (why?), so it indeed is the one. I, again, have to say that the amount of stuff that was done as part of commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 Author: Peter Eisentraut <peter_e@gmx.net> Date: 2017-03-23 08:36:36 -0400 Logical replication support for initial data copy is insane. Adding support for running sql over replication connections and extending CREATE_REPLICATION_SLOT with new options (without even mentioning that in the commit message!) as part of a commit described as "Logical replication support for initial data copy" shouldn't happen. It's not obvious to me what buffer pins could be held at this point. I wonder if this could be somehow related to commit 3cb646264e8ced9f25557ce271284da512d92043 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2018-07-18 12:15:16 -0400 Use a ResourceOwner to track buffer pins in all cases. ... In passing, remove some other ad-hoc resource owner creations that had gotten cargo-culted into various other places. As far as I can tell that was all unnecessary, and if it had been necessary it was incomplete, due to lacking any provision for clearing those resowners later. (Also worth noting in this connection is that a process that hasn't called InitBufferPoolBackend has no business accessing buffers; so there's more to do than just add the resowner if we want to touch buffers in processes not covered by this patch.) which removed the resowner previously used in walsender. At the very least we should remove the SavedResourceOwnerDuringExport dance that's still done in snapbuild.c. But it can't really be at fault here, because the crashing backend won't have used that. So I'm a bit confused here. The best approach is probably to try to reproduce this by adding an artifical delay into backend shutdown. > (I still really dislike the fact that we have this evil hack allowing > one connection to mix and match those sets of commands...) FWIW, I think the opposite. We should get rid of the difference as much as possible. Greetings, Andres Freund
pgsql-hackers by date: