Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Build-farm - intermittent error in 031_column_list.pl |
Date | |
Msg-id | CAA4eK1Lc=NDV1HrY2gNasFK90MtysnA575a+rd0p+POjXN+Spw@mail.gmail.com Whole thread Raw |
In response to | Re: Build-farm - intermittent error in 031_column_list.pl (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: Build-farm - intermittent error in 031_column_list.pl
|
List | pgsql-hackers |
On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in > > Hi hackers. > > > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD > > > > The error happened during "subscriptionCheck" phase in the TAP test > > t/031_column_list.pl > > This test file was added by this [2] commit. > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 > > What is happening for all of them looks like that the name of a > publication created by CREATE PUBLICATION without a failure report is > missing for a walsender came later. It seems like CREATE PUBLICATION > can silently fail to create a publication, or walsender somehow failed > to find existing one. > Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed? > > > ~~ > > > > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002user=bf database=postgres host=[local] > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then > the session has been disconnected. But the following request for the > same publication fails due to the absense of the publication. > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88(proto_version '3', publication_names '"pub9"') > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The probable theory is that ALTER SUBSCRIPTION will lead to restarting of apply worker (which we can see in LOGS as well) and after the restart, the apply worker will use the existing slot and replication origin corresponding to the subscription. Now, it is possible that before restart the origin has not been updated and the WAL start location points to a location prior to where PUBLICATION pub9 exists which can lead to such an error. Once this error occurs, apply worker will never be able to proceed and will always return the same error. Does this make sense? Unless you or others see a different theory, this seems to be the existing problem in logical replication which is manifested by this test. If we just want to fix these test failures, we can create a new subscription instead of altering the existing publication to point to the new publication. Note: Added Tomas to know his views as he has committed this test. -- With Regards, Amit Kapila.
pgsql-hackers by date: