Thread: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17072 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 13.3 Operating system: Ubuntu 20.04 Description: When running multiple installchecks (100-200) in a loop I get the following fail: Core was generated by `postgres: postgres regress119 127.0.0.1(60410) in'. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f1c7203d535 in __GI_abort () at abort.c:79 #2 0x00005645576b41a2 in ExceptionalCondition ( conditionName=conditionName@entry=0x564557840fc8 "pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO", errorType=errorType@entry=0x56455771001d "FailedAssertion", fileName=fileName@entry=0x56455774676d "proc.c", lineNumber=lineNumber@entry=498) at assert.c:67 #3 0x000056455757b062 in InitProcess () at proc.c:498 #4 0x00005645575893df in PostgresMain (argc=1, argv=argv@entry=0x5645581d5820, dbname=<optimized out>, username=0x5645581d5168 "postgres") at postgres.c:4393 #5 0x00005645574e6f91 in BackendRun (port=0x5645581c73d0, port=0x5645581c73d0) at postmaster.c:5784 #6 BackendStartup (pool=pool@entry=0x0, port=port@entry=0x5645581c73d0) at postmaster.c:5440 #7 0x00005645574e769b in ServerLoop () at postmaster.c:2403 #8 0x00005645574e8e4b in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1870 #9 0x00005645571db177 in main (argc=3, argv=0x56455819ad30) at main.c:211 (gdb) print MyProc->clogGroupNext $1 = {value = 985} (gdb) print &MyProc->clogGroupNext $2 = (pg_atomic_uint32 *) 0x7f1c64806a0c With some diagnostic output added I've got in the server log: 2021-06-24 16:44:16.018 MSK|postgres|regress184|60d48c2d.ea21|WARNING: TransactionGroupUpdateXidStatus: 819, 2147483647 ... 2021-06-24 16:44:16.018 MSK|postgres|regress54|60d48c27.df96|WARNING: TransactionGroupUpdateXidStatus: 982, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dd8ec, 974 2021-06-24 16:44:16.018 MSK|postgres|regress44|60d48c2a.e5a7|WARNING: TransactionGroupUpdateXidStatus: 981, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64880824, 974 2021-06-24 16:44:16.021 MSK|postgres|regress32|60d48c2f.ee50|WARNING: TransactionGroupUpdateXidStatus: after while: 974 2021-06-24 16:44:16.018 MSK|postgres|regress144|60d48c2e.ebc5|WARNING: TransactionGroupUpdateXidStatus: 963, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64806a0c, 2147483647 2021-06-24 16:44:16.018 MSK|postgres|regress149|60d48c2a.e674|WARNING: TransactionGroupUpdateXidStatus: 799, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648610e4, 944 2021-06-24 16:44:16.018 MSK|postgres|regress114|60d48c29.e3e1|WARNING: TransactionGroupUpdateXidStatus: 816, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dd8ec, 944 2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648e95a4, 2147483647 2021-06-24 16:44:16.018 MSK|postgres|regress147|60d48c2f.efb7|ERROR: cannot change routine kind 2021-06-24 16:44:16.018 MSK|postgres|regress147|60d48c2f.efb7|DETAIL: "functest1" is a function. 2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64880824, 944 2021-06-24 16:44:16.021 MSK|postgres|regress50|60d48c2b.e710|ERROR: relation "does_not_exist" does not exist 2021-06-24 16:44:16.020 MSK|postgres|regress22|60d48c2f.efed|WARNING: TransactionGroupUpdateXidStatus: after while: 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dc3f4, 2147483647 2021-06-24 16:44:16.020 MSK|postgres|regress36|60d48c2f.eff6|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO: 0x7f1c64828b3c, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648e02dc, 2147483647 2021-06-24 16:44:16.020 MSK|postgres|regress11|60d48c2f.ef8e|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c647f6e6c, 792 2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648daefc, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64806a0c, 944 2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING: TransactionGroupUpdateXidStatus: after while: 944 2021-06-24 16:44:16.020 MSK|postgres|regress155|60d48c2f.efe1|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648ad114, 792 2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dd8ec, 888 2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648e95a4, 888 2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64880824, 888 2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dc3f4, 888 2021-06-24 16:44:16.021 MSK|postgres|regress144|60d48c2e.ebc5|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648c358c, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648e95a4, 985 2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING: TransactionGroupUpdateXidStatus: after while: 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress149|60d48c2a.e674|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c647ec6ac, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648daefc, 985 2021-06-24 16:44:16.021 MSK|postgres|regress22|60d48c2f.efed|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO: 0x7f1c648610e4, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress36|60d48c2f.eff6|WARNING: TransactionGroupUpdateXidStatus: end 2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64806a0c, 985 2021-06-24 16:44:16.021 MSK|postgres|regress114|60d48c29.e3e1|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64802b24, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress11|60d48c2f.ef8e|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c647f6e6c, 985 2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dd8ec, 985 2021-06-24 16:44:16.021 MSK|postgres|regress50|60d48c2b.e710|ERROR: syntax error at or near "arg" at character 22 2021-06-24 16:44:16.021 MSK|postgres|regress155|60d48c2f.efe1|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648ad114, 985 2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c64880824, 985 2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx: 0x7f1c648dc3f4, 985 2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING: TransactionGroupUpdateXidStatus: after while: 985 2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO: 0x7f1c648e95a4, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress144|60d48c2e.ebc5|WARNING: TransactionGroupUpdateXidStatus: after while: 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING: TransactionGroupUpdateXidStatus: return false (963) 2021-06-24 16:44:16.021 MSK|postgres|regress22|60d48c2f.efed|WARNING: TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO: 0x7f1c648aa724, 2147483647 2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING: TransactionGroupUpdateXidStatus: return false (963) 2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING: TransactionGroupUpdateXidStatus: return true (944) 2021-06-24 16:44:16.021 MSK|postgres|regress11|60d48c2f.ef8e|WARNING: TransactionGroupUpdateXidStatus: return false (963) 2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING: TransactionGroupUpdateXidStatus: return false (963) The offending (the one that leaved a "valid" clogGroupNext) proccess is 60d48c2d.ea21. It looks like it got from the pg_atomic_compare_exchange_u32() the nextidx value that was written in the clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that.
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Amit Kapila
Date:
On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > The offending (the one that leaved a "valid" clogGroupNext) proccess is > 60d48c2d.ea21. It looks like it got from the > pg_atomic_compare_exchange_u32() the nextidx value that was written in the > clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that. > Your analysis seems to be in the right direction. Can you try by setting clogGroupNext to INVALID_PGPROCNO (pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before we return false in the first while(true) loop in function TransactionGroupUpdateXidStatus()? I think this should be reproducible on all branches from HEAD till v11. Have you tried in any other branch? I'll also try to reproduce it. -- With Regards, Amit Kapila.
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Alexander Lakhin
Date:
24.06.2021 19:00, PG Bug reporting form wrote: > When running multiple installchecks (100-200) in a loop I get the following > fail: If it could be of any use, please look at the draft script that executes multiple installchecks on one instance. (Sorry for a dirty play with the makefle and neighbors — it's not intended for the production use.) I started to develop this script to discover a hardware or software issue on a powerful server (192 cores, 4 TB RAM, ...). Single installcheck is a needle in a haystack for such hardware, so I tried to run multiple instances and multiple installchecks for each. Some tests are not ready to run in a such configuration, but most of the tests work and bring some findings. Best regards, Alexander
Attachment
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Alexander Lakhin
Date:
Hello Amit, 25.06.2021 12:55, Amit Kapila wrote: > On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form > <noreply@postgresql.org> wrote: >> The offending (the one that leaved a "valid" clogGroupNext) proccess is >> 60d48c2d.ea21. It looks like it got from the >> pg_atomic_compare_exchange_u32() the nextidx value that was written in the >> clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that. >> > Your analysis seems to be in the right direction. Can you try by > setting clogGroupNext to INVALID_PGPROCNO > (pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before > we return false in the first while(true) loop in function > TransactionGroupUpdateXidStatus()? With this modification that assert is not triggered, all 100 iterations pass fine (triple checked). > I think this should be reproducible on all branches from HEAD till > v11. Have you tried in any other branch? I'll also try to reproduce > it. I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and master. Best regards, Alexander
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Amit Kapila
Date:
On Fri, Jun 25, 2021 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Amit, > 25.06.2021 12:55, Amit Kapila wrote: > > On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form > > <noreply@postgresql.org> wrote: > >> The offending (the one that leaved a "valid" clogGroupNext) proccess is > >> 60d48c2d.ea21. It looks like it got from the > >> pg_atomic_compare_exchange_u32() the nextidx value that was written in the > >> clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that. > >> > > Your analysis seems to be in the right direction. Can you try by > > setting clogGroupNext to INVALID_PGPROCNO > > (pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before > > we return false in the first while(true) loop in function > > TransactionGroupUpdateXidStatus()? > With this modification that assert is not triggered, all 100 iterations > pass fine (triple checked). > Okay, please find the patch for the same attached. > > I think this should be reproducible on all branches from HEAD till > > v11. Have you tried in any other branch? I'll also try to reproduce > > it. > I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and > master. > Please see if you can verify whether the attached fixes it in all the branches? I have also reproduced it in a bit different way by using a debugger. Basically, by having three sessions trying to commit at the same time. After the first session became the first group member, allow the second session to check if it can become a member and stop it via debugger just before it becomes the member. Then, allow the first session to complete the transaction and allow the third session to become the group leader (or first group member). After that when the second session tries to become the member, it will notice that the leader has changed and again try to become a member of the new leader. Then, I forced via debugger to allow the second member to return false and perform the commit by itself. Next, disconnect and connect again in the second session and we will see assertion failure as reported by you. The attached patch fixes the assertion failure. -- With Regards, Amit Kapila.
Attachment
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Alexander Lakhin
Date:
Hello Amit, 26.06.2021 12:57, Amit Kapila wrote: > >>> I think this should be reproducible on all branches from HEAD till >>> v11. Have you tried in any other branch? I'll also try to reproduce >>> it. >> I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and >> master. >> > Please see if you can verify whether the attached fixes it in all the > branches? Yes, the attached patch fixes the issue in all the branches. Thanks for the fix! Best regards, Alexander
Re: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()
From
Amit Kapila
Date:
On Mon, Jun 28, 2021 at 12:30 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Amit, > 26.06.2021 12:57, Amit Kapila wrote: > > > >>> I think this should be reproducible on all branches from HEAD till > >>> v11. Have you tried in any other branch? I'll also try to reproduce > >>> it. > >> I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and > >> master. > >> > > Please see if you can verify whether the attached fixes it in all the > > branches? > Yes, the attached patch fixes the issue in all the branches. > Thanks for the fix! > Pushed, thanks for the report and all the testing. -- With Regards, Amit Kapila.