Thread: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17055 Logged by: Sergey Bernikov Email address: sbernikov@gmail.com PostgreSQL version: 13.3 Operating system: Ubuntu 18.04.4 Description: This error happens when logical replication target is partitioned table. Steps to reproduce: 1. in source DB: create table and add to publication create table test_replication ( id int not null, value varchar(100), primary key (id) ); create publication test_publication for table test_replication; 2. in target DB: create partitioned table and start replication create table test_replication ( id int not null, value varchar(100), primary key (id) ) partition by range (id); create table test_replication_p_1 partition of test_replication for values from (0) to (10); create table test_replication_p_2 partition of test_replication for values from (10) to (20); create subscription test_subscription CONNECTION '...' publication test_publication; 3. in source DB: insert data insert into test_replication(id, value) values (1, 'a1'); 4. in target DB: delete data delete from test_replication where id = 1; 5. in source DB: update data update test_replication set value = 'a2' where id = 1; Result: logical replication worker on target server crashes with different error message: Fragment of log output: 2021-06-07 19:51:13.480 +06 [82800] DEBUG: logical replication did not find row for update in replication target relation "test_replication_p_1" 2021-06-07 19:51:13.491 +06 [82800] ERROR: could not access status of transaction 1226709024 2021-06-07 19:51:13.491 +06 [82800] DETAIL: Could not open file "pg_subtrans/491E": No such file or directory. 2021-06-07 19:51:13.493 +06 [82797] DEBUG: starting logical replication worker for subscription "test_subscription" 2021-06-07 19:51:13.493 +06 [80587] DEBUG: registering background worker "logical replication worker for subscription 781420" 2021-06-07 19:51:13.493 +06 [80587] DEBUG: starting background worker process "logical replication worker for subscription 781420" 2021-06-07 19:51:13.501 +06 [28356] LOG: logical replication apply worker for subscription "test_subscription" has started 2021-06-07 19:51:13.501 +06 [28356] DEBUG: connecting to publisher using connection string "....." 2021-06-07 19:51:13.508 +06 [80587] DEBUG: unregistering background worker "logical replication worker for subscription 781420" 2021-06-07 19:51:13.508 +06 [80587] LOG: background worker "logical replication worker" (PID 82800) exited with exit code 1 2021-06-07 19:51:13.563 +06 [28356] DEBUG: logical replication did not find row for update in replication target relation "test_replication_p_1" 2021-06-07 19:52:31.575 +06 [80587] LOG: background worker "logical replication worker" (PID 28356) was terminated by signal 11: Segmentation fault 2021-06-07 19:52:31.575 +06 [80587] LOG: terminating any other active server processes
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > This error happens when logical replication target is partitioned table. Thanks for the report! This one reproduces on HEAD as well as v13. Also, people should be aware that this one causes the logrep worker to crash repeatedly, which will fill your disk with core files quite quickly. regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Amit Langote
Date:
On Fri, Jun 11, 2021 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > PG Bug reporting form <noreply@postgresql.org> writes: > > This error happens when logical replication target is partitioned table. > > Thanks for the report! This one reproduces on HEAD as well as v13. +1, confirmed here too. > Also, people should be aware that this one causes the logrep worker > to crash repeatedly, which will fill your disk with core files > quite quickly. Thanks for this tip. The problem is that apply_handle_tuple_routing() marches on as if it could go ahead and update a tuple that it was just told doesn't exist. Attached fixes the problem for me. Not sure if adding a test case makes sense. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 02:56:58PM +0900, Amit Langote wrote: > Attached fixes the problem for me. Not sure if adding a test case makes sense. It seems to me that it would make sense to have some coverage. We've caught issues in the past in this area, particularly with buildfarm animals running CLOBBER_CACHE_ALWAYS. -- Michael
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Amit Langote
Date:
On Fri, Jun 11, 2021 at 3:26 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Jun 11, 2021 at 02:56:58PM +0900, Amit Langote wrote: > > Attached fixes the problem for me. Not sure if adding a test case makes sense. > > It seems to me that it would make sense to have some coverage. We've > caught issues in the past in this area, particularly with buildfarm > animals running CLOBBER_CACHE_ALWAYS. I thought about it for a minute when writing the patch I posted, but wasn't really sure what the test case would look like. The only external proof that the case worked correctly, AFAICS, is the DEBUG1 message that's written to the log. How do we add a test case with that? I know you mentioned an idea before to check/grep the log for such cases. Is that how? -- Amit Langote EDB: http://www.enterprisedb.com
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes: > On Fri, Jun 11, 2021 at 3:26 PM Michael Paquier <michael@paquier.xyz> wrote: >> It seems to me that it would make sense to have some coverage. We've >> caught issues in the past in this area, particularly with buildfarm >> animals running CLOBBER_CACHE_ALWAYS. > I thought about it for a minute when writing the patch I posted, but > wasn't really sure what the test case would look like. The only > external proof that the case worked correctly, AFAICS, is the DEBUG1 > message that's written to the log. How do we add a test case with > that? I know you mentioned an idea before to check/grep the log for > such cases. Is that how? I see from the coverage report that *none* of the did-not-find-tuple code paths in worker.c are exercised. This does not seem OK. I agree that probably the only way to have a test case is to kick up the debug level to DEBUG1 and grep the log to confirm that the message is there. IIRC, we already have infrastructure for grepping the log, so this shouldn't be that hard. Will work on it today. (Meanwhile, thanks for the diagnosis and patch!) regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
I wrote: > I see from the coverage report that *none* of the did-not-find-tuple > code paths in worker.c are exercised. This does not seem OK. I agree > that probably the only way to have a test case is to kick up the debug > level to DEBUG1 and grep the log to confirm that the message is there. > IIRC, we already have infrastructure for grepping the log, so this > shouldn't be that hard. Will work on it today. Done now. It was a highly worthwhile exercise, too, because I stumbled over two *other* bugs in this code while I was at it. BTW, I intentionally reworded the DEBUG messages to make sure we could tell the partitioned case apart from the non-partitioned one. regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 04:15:20PM -0400, Tom Lane wrote: > Done now. It was a highly worthwhile exercise, too, because I stumbled > over two *other* bugs in this code while I was at it. > > BTW, I intentionally reworded the DEBUG messages to make sure we could > tell the partitioned case apart from the non-partitioned one. I have two comments about the TAP portion of ab55d74. +$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1"); +$node_subscriber->reload; Wouldn't it be better to make sure that the change is effective before moving on to the end of the test? +my $logfile = slurp_file($node_subscriber->logfile()); +ok( $logfile =~ + qr/logical replication did not find row to be updated in replication target relation "tab_full_pk"/, + 'update target row is missing'); I think that it would be better here to grab the log location before running the queries that will be matched with the logs, then pass down the location to slurp_file(). -- Michael
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > +$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1"); > +$node_subscriber->reload; > Wouldn't it be better to make sure that the change is effective before > moving on to the end of the test? AFAICS, none of the other callers of sub reload are doing anything special to wait for it to take effect. Not sure why this one would need more protection. If we do need a fix for that, likely sub reload itself had better do it. > +my $logfile = slurp_file($node_subscriber->logfile()); > +ok( $logfile =~ > + qr/logical replication did not find row to be updated in replication target relation "tab_full_pk"/, > + 'update target row is missing'); > I think that it would be better here to grab the log location before > running the queries that will be matched with the logs, then pass down > the location to slurp_file(). This coding is also stolen verbatim from other test scripts. What is your concern about it exactly? It's not like the tests have such varying amounts of log output that an unexpected log rotation would be likely. (Does the default TAP setup even allow log rotation at all? I doubt it.) regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 08:31:30PM -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > AFAICS, none of the other callers of sub reload are doing anything > special to wait for it to take effect. Not sure why this one > would need more protection. If we do need a fix for that, likely > sub reload itself had better do it. Perhaps. >> +my $logfile = slurp_file($node_subscriber->logfile()); >> +ok( $logfile =~ >> + qr/logical replication did not find row to be updated in replication target relation "tab_full_pk"/, >> + 'update target row is missing'); >> I think that it would be better here to grab the log location before >> running the queries that will be matched with the logs, then pass down >> the location to slurp_file(). > > This coding is also stolen verbatim from other test scripts. > What is your concern about it exactly? My main concern are future changes in this test suite that could cause more queries to generate logs that overlap with this sequence of tests, falsifying what this test is checking. > It's not like the > tests have such varying amounts of log output that an unexpected > log rotation would be likely. (Does the default TAP setup even > allow log rotation at all? I doubt it.) The default TAP configuration does not use log rotation, because it causes complications in the test code as it makes necessary wait phases to make sure that the logs have rotated. That's also more costly than just getting the log position, run the queries and slurp()'ing the file from the previous position. -- Michael
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Jun 11, 2021 at 08:31:30PM -0400, Tom Lane wrote: >> AFAICS, none of the other callers of sub reload are doing anything >> special to wait for it to take effect. Not sure why this one >> would need more protection. If we do need a fix for that, likely >> sub reload itself had better do it. > Perhaps. I looked around a bit, and I think that we don't need to sweat about this. Certainly we've not identified any buildfarm failures related to reload not happening fast enough. I think the sequence of events is: * pg_ctl will deliver the SIGHUP signal to the postmaster before returning. * Although surely the postmaster might not process that signal instantly, we can expect that it will do so before it accepts another connection request. Since the tests cannot detect whether the reload has happened without issuing a new command (which, in all these tests, involves launching psql or pgbench or pg_basebackup or what-have-you, which needs a new connection), the reload will appear to have happened so far as the postmaster itself is concerned, and therefore also in any new backend it spawns. * This does seem to leave in question whether a pre-existing background process such as a logrep worker will get the word fast enough. However, I think the same principle applies there. The postmaster will have turned around and signaled the worker process while it processes the SIGHUP for itself. While that again doesn't ensure instantaneous update of the worker's state, we can expect that it will notice the SIGHUP before noticing any newly-arrived replication data, which is Good Enough. This argument does assume that we know the worker is idle when we issue the reload ... but we already waited for it to be idle. Perhaps the buildfarm will prove me wrong, but I think we don't have to do anything. And if we did want to do something, what would it be? There is no feedback mechanism to let us observe when the worker processed the signal. >> This coding is also stolen verbatim from other test scripts. >> What is your concern about it exactly? > My main concern are future changes in this test suite that could cause > more queries to generate logs that overlap with this sequence of > tests, falsifying what this test is checking. Oh, I see, you wonder if there could already be entries like these in the file. But since we only turned the log level up for the duration of the specific test, it doesn't seem like a problem. (I did think of this to the extent of making sure the second test group in 013_partition.pl looks for different messages than the first group does.) regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 09:14:01PM -0400, Tom Lane wrote: > Perhaps the buildfarm will prove me wrong, but I think we don't > have to do anything. And if we did want to do something, what > would it be? There is no feedback mechanism to let us observe > when the worker processed the signal. I am fine to do nothing here unless the buildfarm says something. > Oh, I see, you wonder if there could already be entries like these > in the file. But since we only turned the log level up for the > duration of the specific test, it doesn't seem like a problem. That's not a problem, now. > (I did think of this to the extent of making sure the second test > group in 013_partition.pl looks for different messages than the > first group does.) Still, the current style is not a pattern I think we should spread around, as a lot of copy-paste gets done in new tests added and one would easily mix up queries with pattern checks. The practice I'd like to encourage is something like the attached, to make clear that there is no risk to fetch incorrect logs. -- Michael
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Amit Langote
Date:
On Sat, Jun 12, 2021 at 5:15 Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> I see from the coverage report that *none* of the did-not-find-tuple
> code paths in worker.c are exercised. This does not seem OK. I agree
> that probably the only way to have a test case is to kick up the debug
> level to DEBUG1 and grep the log to confirm that the message is there.
> IIRC, we already have infrastructure for grepping the log, so this
> shouldn't be that hard. Will work on it today.
Done now. It was a highly worthwhile exercise, too, because I stumbled
over two *other* bugs in this code while I was at it.
Oh, thanks for taking care of that.
BTW, I intentionally reworded the DEBUG messages to make sure we could
tell the partitioned case apart from the non-partitioned one.
That looks useful.
Amit Langote
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > Still, the current style is not a pattern I think we should spread > around, as a lot of copy-paste gets done in new tests added and one > would easily mix up queries with pattern checks. The practice I'd > like to encourage is something like the attached, to make clear that > there is no risk to fetch incorrect logs. If you prefer. However, I'd note that this patch as it stands sort of breaks the argument I just made, as now you are interrogating the postmaster's state after sending SIGHUP and before doing anything that will wait for that to take effect. I think you should move the location-capture commands down by at least one safe_psql call. regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 09:37:05PM -0400, Tom Lane wrote: > If you prefer. However, I'd note that this patch as it stands > sort of breaks the argument I just made, as now you are interrogating > the postmaster's state after sending SIGHUP and before doing anything > that will wait for that to take effect. I think you should move the > location-capture commands down by at least one safe_psql call. Ah, right. That would be after "DELETE FROM tab_full_pk" in 001_rep_changes.pl and after "DELETE FROM tab1" in 001_rep_changes.pl run on the subscribers, then. Would you prefer doing the change? If you don't think that's not worth it, I won't fight hard on it, but while we're dealing with it.. -- Michael
Attachment
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > Ah, right. That would be after "DELETE FROM tab_full_pk" in > 001_rep_changes.pl and after "DELETE FROM tab1" in 001_rep_changes.pl > run on the subscribers, then. Would you prefer doing the change? No, go right ahead. regards, tom lane
Re: BUG #17055: Logical replication worker crashes when applying update of row that dose not exist in target partiti
From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 09:51:13PM -0400, Tom Lane wrote: > No, go right ahead. Okay, thanks. Done this way for all three tests, with explanations of the how and the why. -- Michael