Thread: BUG #16036: Segmentation fault while doing an update
The following bug has been logged on the website: Bug reference: 16036 Logged by: Антон Власов Email address: druidvav@gmail.com PostgreSQL version: 12.0 Operating system: PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64 Description: I am having segmentation fault like this: 2019-10-04 02:30:02 MSK [16919-6] LOG: server process (PID 18415) was terminated by signal 11: Segmentation fault 2019-10-04 02:30:02 MSK [16919-7] DETAIL: Failed process was running: update tracking.test_session set list_position = list_position + 1 where id = $1 2019-10-04 02:30:02 MSK [16919-8] LOG: terminating any other active server processes When there are several queries running simultaneously. Tried to get some info from coredump, but that's all i have: (gdb) bt #0 GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127 #1 pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033 Backtrace stopped: Cannot access memory at address 0x7ffc990e7218
Got normal backtrace: #0 GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127 #1 pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033 #2 0x0000564c3f007aca in heap_freetuple (htup=<optimized out>) at ./build/../src/backend/access/common/heaptuple.c:1340 #3 0x0000564c3f1b5889 in tts_buffer_heap_clear (slot=0x564c40650298) at ./build/../src/backend/executor/execTuples.c:652 #4 0x0000564c3f1b5cbe in ExecClearTuple (slot=0x564c40650298) at ./build/../src/include/executor/tuptable.h:428 #5 ExecResetTupleTable (tupleTable=0x564c4068a6b0, shouldFree=false) at ./build/../src/backend/executor/execTuples.c:1165 #6 0x0000564c3f1abc40 in ExecEndPlan (estate=0x564c40689740, planstate=<optimized out>) at ./build/../src/backend/executor/execMain.c:1560 #7 standard_ExecutorEnd (queryDesc=0x564c40663d80) at ./build/../src/backend/executor/execMain.c:496 #8 0x0000564c3f30a6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x564c40663c60 "update tracking.test_session setlist_position = list_position + 1 where id = $1", params=0x564c40663cf0, queryEnv=0x0, dest=0x564c3f8b1c00 <donothingDR>, completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204 #9 0x0000564c3f30a8c3 in PortalRunMulti (portal=portal@entry=0x564c405cb3b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=0x564c3f8b1c00 <donothingDR>, dest@entry=0x564c4052dc90, altdest=0x564c3f8b1c00<donothingDR>, altdest@entry=0x564c4052dc90, completionTag=completionTag@entry=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:1283 #10 0x0000564c3f30b4db in PortalRun (portal=portal@entry=0x564c405cb3b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=<optimized out>, dest=dest@entry=0x564c4052dc90, altdest=altdest@entry=0x564c4052dc90, completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796 #11 0x0000564c3f308d05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x564c4052d880 "") at ./build/../src/backend/tcop/postgres.c:2090 #12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x564c40578e80, dbname=<optimized out>, username=<optimized out>)at ./build/../src/backend/tcop/postgres.c:4297 #13 0x0000564c3effcbe1 in BackendRun (port=0x564c40574ba0, port=0x564c40574ba0) at ./build/../src/backend/postmaster/postmaster.c:4431 #14 BackendStartup (port=0x564c40574ba0) at ./build/../src/backend/postmaster/postmaster.c:4122 #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704 #16 0x0000564c3f28fe81 in PostmasterMain (argc=5, argv=0x564c40524b70) at ./build/../src/backend/postmaster/postmaster.c:1377 #17 0x0000564c3effdea8 in main (argc=5, argv=0x564c40524b70) at ./build/../src/backend/main/main.c:228 > 4 окт. 2019 г., в 3:29, PG Bug reporting form <noreply@postgresql.org> написал(а): > > The following bug has been logged on the website: > > Bug reference: 16036 > Logged by: Антон Власов > Email address: druidvav@gmail.com > PostgreSQL version: 12.0 > Operating system: PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64 > Description: > > I am having segmentation fault like this: > > 2019-10-04 02:30:02 MSK [16919-6] LOG: server process (PID 18415) was > terminated by signal 11: Segmentation fault > 2019-10-04 02:30:02 MSK [16919-7] DETAIL: Failed process was running: > update tracking.test_session set list_position = list_position + 1 where id > = $1 > 2019-10-04 02:30:02 MSK [16919-8] LOG: terminating any other active server > processes > > When there are several queries running simultaneously. Tried to get some > info from coredump, but that's all i have: > > (gdb) bt > #0 GetMemoryChunkContext (pointer=0x0) at > ./build/../src/include/utils/memutils.h:127 > #1 pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033 > Backtrace stopped: Cannot access memory at address 0x7ffc990e7218 >
Hi, On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <noreply@postgresql.org> wrote: >The following bug has been logged on the website: > >Bug reference: 16036 >Logged by: Антон Власов >Email address: druidvav@gmail.com >PostgreSQL version: 12.0 >Operating system: PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64 >Description: > >I am having segmentation fault like this: > >2019-10-04 02:30:02 MSK [16919-6] LOG: server process (PID 18415) was >terminated by signal 11: Segmentation fault >2019-10-04 02:30:02 MSK [16919-7] DETAIL: Failed process was running: >update tracking.test_session set list_position = list_position + 1 >where id >= $1 >2019-10-04 02:30:02 MSK [16919-8] LOG: terminating any other active >server >processes > >When there are several queries running simultaneously. Tried to get >some >info from coredump, but that's all i have: > >(gdb) bt >#0 GetMemoryChunkContext (pointer=0x0) at >./build/../src/include/utils/memutils.h:127 >#1 pfree (pointer=0x0) at >./build/../src/backend/utils/mmgr/mcxt.c:1033 >Backtrace stopped: Cannot access memory at address 0x7ffc990e7218 Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner? If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace after settingjit_debugging_support=1? Regards, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hello, Looks like disabling jit didn’t affect backtrace at all: 0 GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127 #1 pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033 #2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at ./build/../src/backend/access/common/heaptuple.c:1340 #3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) at ./build/../src/backend/executor/execTuples.c:652 #4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at ./build/../src/include/executor/tuptable.h:428 #5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) at ./build/../src/backend/executor/execTuples.c:1165 #6 0x0000555d7290ec40 in ExecEndPlan (estate=0x555d73db4aa0, planstate=<optimized out>) at ./build/../src/backend/executor/execMain.c:1560 #7 standard_ExecutorEnd (queryDesc=0x555d73ce1f40) at ./build/../src/backend/executor/execMain.c:496 #8 0x0000555d72a6d6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x555d73ce1e20 "update tracking.test_session setlist_position = list_position + 1 where id = $1", params=0x555d73ce1eb0, queryEnv=0x0, dest=0x555d73014c00 <donothingDR>, completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204 #9 0x0000555d72a6d8c3 in PortalRunMulti (portal=portal@entry=0x555d73c4d3f0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=0x555d73014c00 <donothingDR>, dest@entry=0x555d73bafc90, altdest=0x555d73014c00<donothingDR>, altdest@entry=0x555d73bafc90, completionTag=completionTag@entry=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:1283 #10 0x0000555d72a6e4db in PortalRun (portal=portal@entry=0x555d73c4d3f0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=<optimized out>, dest=dest@entry=0x555d73bafc90, altdest=altdest@entry=0x555d73bafc90, completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796 #11 0x0000555d72a6bd05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x555d73baf880 "") at ./build/../src/backend/tcop/postgres.c:2090 #12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x555d73bfb2e0, dbname=<optimized out>, username=<optimized out>)at ./build/../src/backend/tcop/postgres.c:4297 #13 0x0000555d7275fbe1 in BackendRun (port=0x555d73bf6be0, port=0x555d73bf6be0) at ./build/../src/backend/postmaster/postmaster.c:4431 #14 BackendStartup (port=0x555d73bf6be0) at ./build/../src/backend/postmaster/postmaster.c:4122 #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704 #16 0x0000555d729f2e81 in PostmasterMain (argc=5, argv=0x555d73ba6b70) at ./build/../src/backend/postmaster/postmaster.c:1377 #17 0x0000555d72760ea8 in main (argc=5, argv=0x555d73ba6b70) at ./build/../src/backend/main/main.c:228 > 4 окт. 2019 г., в 4:09, Andres Freund <andres@anarazel.de> написал(а): > > Hi, > > On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <noreply@postgresql.org> wrote: >> The following bug has been logged on the website: >> >> Bug reference: 16036 >> Logged by: Антон Власов >> Email address: druidvav@gmail.com >> PostgreSQL version: 12.0 >> Operating system: PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64 >> Description: >> >> I am having segmentation fault like this: >> >> 2019-10-04 02:30:02 MSK [16919-6] LOG: server process (PID 18415) was >> terminated by signal 11: Segmentation fault >> 2019-10-04 02:30:02 MSK [16919-7] DETAIL: Failed process was running: >> update tracking.test_session set list_position = list_position + 1 >> where id >> = $1 >> 2019-10-04 02:30:02 MSK [16919-8] LOG: terminating any other active >> server >> processes >> >> When there are several queries running simultaneously. Tried to get >> some >> info from coredump, but that's all i have: >> >> (gdb) bt >> #0 GetMemoryChunkContext (pointer=0x0) at >> ./build/../src/include/utils/memutils.h:127 >> #1 pfree (pointer=0x0) at >> ./build/../src/backend/utils/mmgr/mcxt.c:1033 >> Backtrace stopped: Cannot access memory at address 0x7ffc990e7218 > > Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner? > > If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace aftersetting jit_debugging_support=1? > > > Regards, > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote: >Hello, > >Looks like disabling jit didn’t affect backtrace at all: > >0 GetMemoryChunkContext (pointer=0x0) at >./build/../src/include/utils/memutils.h:127 >#1 pfree (pointer=0x0) at >./build/../src/backend/utils/mmgr/mcxt.c:1033 >#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at >./build/../src/backend/access/common/heaptuple.c:1340 >#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) >at ./build/../src/backend/executor/execTuples.c:652 >#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at >./build/../src/include/executor/tuptable.h:428 >#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work betterthis time? Are there any details that you can provide? Schema? Any extensions in use? Does the problem happen always, or just under concurrency? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hello, This test script always causes a crash: postgres@gp-app01:~$ cat test.sh psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & create table tracking.test_result ( id serial not null constraint test_result_pkey primary key, session_id integer constraint test_result_session_id_fkey references tracking.test_session, reported_at timestamp(0) default now(), courier varchar(25) not null, tracking_number varchar(60) not null, status varchar(25) not null, fail_combo integer default 0 not null, response_test text, response_real text, response_time real, code smallint, message text, retry integer ); create index test_result_session_id_idx on tracking.test_result (session_id); Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual, only concurrentones cause crash. > 4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а): > > Hi, > > On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote: >> Hello, >> >> Looks like disabling jit didn’t affect backtrace at all: >> >> 0 GetMemoryChunkContext (pointer=0x0) at >> ./build/../src/include/utils/memutils.h:127 >> #1 pfree (pointer=0x0) at >> ./build/../src/backend/utils/mmgr/mcxt.c:1033 >> #2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at >> ./build/../src/backend/access/common/heaptuple.c:1340 >> #3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) >> at ./build/../src/backend/executor/execTuples.c:652 >> #4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at >> ./build/../src/include/executor/tuptable.h:428 >> #5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) > > That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work betterthis time? > > Are there any details that you can provide? Schema? Any extensions in use? > > Does the problem happen always, or just under concurrency? > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
Tried only two concurrent queries and crash is still present, so it’s not about total query count. > 4 окт. 2019 г., в 4:24, Антон Власов <druidvav@gmail.com> написал(а): > > Hello, > > This test script always causes a crash: > > postgres@gp-app01:~$ cat test.sh > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" & > > create table tracking.test_result > ( > id serial not null > constraint test_result_pkey > primary key, > session_id integer > constraint test_result_session_id_fkey > references tracking.test_session, > reported_at timestamp(0) default now(), > courier varchar(25) not null, > tracking_number varchar(60) not null, > status varchar(25) not null, > fail_combo integer default 0 not null, > response_test text, > response_real text, > response_time real, > code smallint, > message text, > retry integer > ); > > create index test_result_session_id_idx > on tracking.test_result (session_id); > > Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual, onlyconcurrent ones cause crash. > > >> 4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а): >> >> Hi, >> >> On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote: >>> Hello, >>> >>> Looks like disabling jit didn’t affect backtrace at all: >>> >>> 0 GetMemoryChunkContext (pointer=0x0) at >>> ./build/../src/include/utils/memutils.h:127 >>> #1 pfree (pointer=0x0) at >>> ./build/../src/backend/utils/mmgr/mcxt.c:1033 >>> #2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at >>> ./build/../src/backend/access/common/heaptuple.c:1340 >>> #3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) >>> at ./build/../src/backend/executor/execTuples.c:652 >>> #4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at >>> ./build/../src/include/executor/tuptable.h:428 >>> #5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) >> >> That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work betterthis time? >> >> Are there any details that you can provide? Schema? Any extensions in use? >> >> Does the problem happen always, or just under concurrency? >> >> Andres >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. >
Tried with default debian configuration - problem persists. I’ve got full backtrace when reduced shared_buffers from 18G to 1G. Tried to simulate the problem in artificial environment, but cannot repeat it.
On real data it is 100% repeatable.
4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):Hi,
On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:Hello,
Looks like disabling jit didn’t affect backtrace at all:
0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?
Are there any details that you can provide? Schema? Any extensions in use?
Does the problem happen always, or just under concurrency?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
At last, i’ve found the source of the problem.
As you can see session_end is not affected in the queries. So i tried to modify the trigger to:
I have a trigger:
create function test_session_bu() returns trigger
language plpgsql
as
$$
begin
if new.session_end is not null and old.session_end is null then
new.is_finished := true;
new.success_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'success');
new.failing_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'failing');
new.fail_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'fail');
end if;
return new;
end;
$$;
create trigger test_session_bu
before update
on tracking.test_session
for each row
execute procedure tracking.test_session_bu();
Everything works fine without it, when i return it — problem returns too.
create or replace function test_session_bu() returns trigger
language plpgsql
as
$$
begin
return new;
end;
$$;
And problem was still in place.
4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):
Hi,
On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:Hello,
Looks like disabling jit didn’t affect backtrace at all:
0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?
Are there any details that you can provide? Schema? Any extensions in use?
Does the problem happen always, or just under concurrency?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hello!
create table test_table (
id serial, field int default 0 not null
);
create or replace function test_table_bu() returns trigger
language plpgsql
as
$$
begin
return new;
end;
$$;
create trigger test_table_bu
before update
on test_table
for each row
execute procedure test_table_bu();
insert into test_table (field) values (0);
test.sh:
psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &
psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &
4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):Hi,
On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:Hello,
Looks like disabling jit didn’t affect backtrace at all:
0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?
Are there any details that you can provide? Schema? Any extensions in use?
Does the problem happen always, or just under concurrency?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On 2019-10-04 05:11:16 +0300, Антон Власов wrote: > Reproducible schema: > > create table test_table ( > id serial, field int default 0 not null > ); > > create or replace function test_table_bu() returns trigger > language plpgsql > as > $$ > begin > return new; > end; > $$; > > create trigger test_table_bu > before update > on test_table > for each row > execute procedure test_table_bu(); > > insert into test_table (field) values (0); > test.sh: > > psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" & > psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" & Thanks, that's very helpful! I've started to narrow down the problem: bool ExecBRUpdateTriggers(EState *estate, EPQState *epqstate, ResultRelInfo *relinfo, ItemPointer tupleid, HeapTuple fdw_trigtuple, TupleTableSlot *newslot) { ... /* * In READ COMMITTED isolation level it's possible that target tuple * was changed due to concurrent update. In that case we have a raw * subplan output tuple in newSlot, and need to run it through the * junk filter to produce an insertable tuple. * * Caution: more than likely, the passed-in slot is the same as the * junkfilter's output slot, so we are clobbering the original value * of slottuple by doing the filtering. This is OK since neither we * nor our caller have any more interest in the prior contents of that * slot. */ if (newSlot != NULL) { TupleTableSlot *slot = ExecFilterJunk(relinfo->ri_junkFilter, newSlot); ExecCopySlot(newslot, slot); } the problem is that 'newslot' and 'slot' are the same. As copying first clears the target slot, that results in tuple slot in a bogus state, with BufferHeapTupleTableSlot->base.tuple = NULL, but TTS_FLAG_EMPTY not set, but TTS_FLAG_SHOULDFREE set. The only reason that doesn't run into the assert in ExecCopySlotHeapTuple() that the slot bet not be empty, is that we've already cleared that on the destination slot (which is also the source slot) at that point... dstslot->tts_flags &= ~TTS_FLAG_EMPTY; oldContext = MemoryContextSwitchTo(dstslot->tts_mcxt); bdstslot->base.tuple = ExecCopySlotHeapTuple(srcslot); MemoryContextSwitchTo(oldContext); So I think what we need to do is: 1) only call ExecCopySlot() if slot != newslot - this fixes the crash 2) Add an assert to ExecCopySlot() ensuring source and target slot are distinct 3) Figure out why our tests didn't catch this, this afaict should be a tested scenario 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > So I think what we need to do is: > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash > 2) Add an assert to ExecCopySlot() ensuring source and target slot are > distinct > 3) Figure out why our tests didn't catch this, this afaict should be a > tested scenario > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers Maybe instead of 1+2, change ExecCopySlot like this: - dstslot->tts_ops->copyslot(dstslot, srcslot); + if (dstslot != srcslot) + dstslot->tts_ops->copyslot(dstslot, srcslot); That would fix other similar instances, not just this one caller. regards, tom lane
Hi, On 2019-10-03 23:26:01 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > So I think what we need to do is: > > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash > > 2) Add an assert to ExecCopySlot() ensuring source and target slot are > > distinct > > 3) Figure out why our tests didn't catch this, this afaict should be a > > tested scenario > > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers > > Maybe instead of 1+2, change ExecCopySlot like this: > > - dstslot->tts_ops->copyslot(dstslot, srcslot); > + if (dstslot != srcslot) > + dstslot->tts_ops->copyslot(dstslot, srcslot); > > That would fix other similar instances, not just this one caller. Yea, I was wondering about that too - but asking for a slot to be copied into itself seems to indicate a logic bug more often than not? Expecting the slots to be independent, afterwards. So I'm a bit hesitant doing so. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Yea, I was wondering about that too - but asking for a slot to be copied > into itself seems to indicate a logic bug more often than not? Expecting > the slots to be independent, afterwards. So I'm a bit hesitant doing so. Oh, good point. The Assert sounds better then. regards, tom lane
Hi, On 2019-10-03 23:26:01 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > So I think what we need to do is: > > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash > > 2) Add an assert to ExecCopySlot() ensuring source and target slot are > > distinct > > 3) Figure out why our tests didn't catch this, this afaict should be a > > tested scenario > > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers > > Maybe instead of 1+2, change ExecCopySlot like this: > > - dstslot->tts_ops->copyslot(dstslot, srcslot); > + if (dstslot != srcslot) > + dstslot->tts_ops->copyslot(dstslot, srcslot); > > That would fix other similar instances, not just this one caller. Here's 1, 2 & 4, working on fixing 3). While we now have OK-ish coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's not much for the trigger case. Greetings, Andres Freund
Attachment
Hi, On 2019-10-03 21:30:21 -0700, Andres Freund wrote: > Here's 1, 2 & 4, working on fixing 3). While we now have OK-ish > coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's > not much for the trigger case. Here's both the fix and tests. I think the test adds a good bit coverage. What's lacking is primarily testing foreign table related paths, but that seems separate enough, and would need to be done in a separate contrib/ test anyway. I think there's a good case to be made to backpatch the tests further than 12, but I'm not sure about it? They do pass (with one error message about a failure to delete changed to a failure to update, we didn't use to be able to discern) back to 9.6, requiring commit 38f8bdcac4982215beb9f65a19debecaf22fd470 Author: Robert Haas <rhaas@postgresql.org> Date: 2016-02-11 08:36:30 -0500 Modify the isolation tester so that multiple sessions can wait. Greetings, Andres Freund
Attachment
Hi, Thanks again for the bug report! I've pushed the fix, and tests. They'll be included in 12.1 (which I assume will happen relatively soon), or you can compile a slightly newer version yourself. Regards, Andres On 2019-10-04 12:28:48 -0700, Andres Freund wrote: > On 2019-10-03 21:30:21 -0700, Andres Freund wrote: > > Here's 1, 2 & 4, working on fixing 3). While we now have OK-ish > > coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's > > not much for the trigger case. > Here's both the fix and tests. > > I think the test adds a good bit coverage. What's lacking is primarily > testing foreign table related paths, but that seems separate enough, and > would need to be done in a separate contrib/ test anyway. > > I think there's a good case to be made to backpatch the tests further > than 12, but I'm not sure about it? They do pass (with one error message > about a failure to delete changed to a failure to update, we didn't use > to be able to discern) back to 9.6, requiring I did push the tests back to 9.6. Greetings, Andres Freund
Hi, On 2019-10-04 14:43:00 -0700, Andres Freund wrote: > > I think there's a good case to be made to backpatch the tests further > > than 12, but I'm not sure about it? They do pass (with one error message > > about a failure to delete changed to a failure to update, we didn't use > > to be able to discern) back to 9.6, requiring > > I did push the tests back to 9.6. There's a few ordering violations in the tests, e.g.: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13 key-a val-a-s1-ups1-ups1 step s1_c: COMMIT; -s3: NOTICE: upd: text key-a = text key-a: t -s3: NOTICE: upk: text val-a-s1-ups1-ups1 <> text mismatch: t -s3: NOTICE: trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3) -s3: NOTICE: trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3) -step s3_upd_a_data: <... completed> +s2: NOTICE: upd: text key-a = text key-a: t +s2: NOTICE: upk: text val-a-s1-ups1-ups1 <> text mismatch: t +s2: NOTICE: trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2) +s2: NOTICE: trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2) +step s2_upsert_a_data: <... completed> key data I was under the assumption that it'd be deterministic who gets to continue with a speculative insertion, but that ain't so. Peter, do you see an easy way around that? Do you consider that a problem? ISTM we ought to make this fair, but that doing so would require a bit bigger changes that we'd want to make in the backbranches? Unless somebody comes up with an idea, I'll disable (or rewrite) the "three way" tests involving two waiters for one insertion. Regards, Andres
On Fri, Oct 4, 2019 at 3:24 PM Andres Freund <andres@anarazel.de> wrote: > I was under the assumption that it'd be deterministic who gets to > continue with a speculative insertion, but that ain't so. There is no heap_lock_tuple() style lock arbitration built in to speculative insertions: https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards I do not consider this to be a problem. > Peter, do you see an easy way around that? Do you consider that a > problem? ISTM we ought to make this fair, but that doing so would > require a bit bigger changes that we'd want to make in the backbranches? It would also require formally defining "fair", which doesn't seem straightforward to me. The MVCC snapshot is barely involved at all. -- Peter Geoghegan
Hi, On 2019-10-04 15:24:37 -0700, Andres Freund wrote: > On 2019-10-04 14:43:00 -0700, Andres Freund wrote: > > > I think there's a good case to be made to backpatch the tests further > > > than 12, but I'm not sure about it? They do pass (with one error message > > > about a failure to delete changed to a failure to update, we didn't use > > > to be able to discern) back to 9.6, requiring > > > > I did push the tests back to 9.6. > > There's a few ordering violations in the tests, e.g.: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13 I've now disabled that portion of the test. It look to me like this might be a subtle upsert bug. What happens is that in the isolationtester sequence (all operating on one row, and all already in a transaction): "s1_upd_a_data" "s3_upd_a_data" "s2_upsert_a_data" "s1_upd_a_data" "s1_c" "s3_del_a" "s3_c" "s2_c" sometimes the s2_upsert_a_data gets to update the row before the s3_upd_a_data. That does *not* happen with a plain update, even though there's at that point always a target row to operate on (i.e. this happens before s3_del_a is reached). I think the cause for that might be that the check_exclusion_or_unique_constraint() check done for upsert (via ExecCheckIndexConstraints() in ExecInsert()) does a XactLockTableWait() without, like heapam.c would, ordering via a tuple lock. Since it's clearly not 12 specific, and not really related to triggers / EPQ, I think disabling it for now is a sensible choice. > key-a val-a-s1-ups1-ups1 > step s1_c: COMMIT; > -s3: NOTICE: upd: text key-a = text key-a: t > -s3: NOTICE: upk: text val-a-s1-ups1-ups1 <> text mismatch: t > -s3: NOTICE: trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3) > -s3: NOTICE: trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3) > -step s3_upd_a_data: <... completed> > +s2: NOTICE: upd: text key-a = text key-a: t > +s2: NOTICE: upk: text val-a-s1-ups1-ups1 <> text mismatch: t > +s2: NOTICE: trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2) > +s2: NOTICE: trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2) > +step s2_upsert_a_data: <... completed> > key data > > I was under the assumption that it'd be deterministic who gets to > continue with a speculative insertion, but that ain't so. This was a preliminary theory, a wrong one: Speculative insertions aren't actually involved, see above... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I've now disabled that portion of the test. There's still a large fraction of the buildfarm that's unhappy :-( regards, tom lane
Hi, On 2019-10-05 11:30:39 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I've now disabled that portion of the test. > > There's still a large fraction of the buildfarm that's unhappy :-( I pushed a second patch disabling two more permutations, I've not seen any related failures since. It's frustrating that we at the moment seem to not have deterministic behaviour with two sessions waiting for a tuple lock at the moment. Doesn't look like that's new though. It seems to me that most of the XactLockTableWait() users ought to follow the pattern outlined at the top of README.tuplock, but don't. Greetings, Andres Freund