Thread: BUG #16036: Segmentation fault while doing an update

BUG #16036: Segmentation fault while doing an update

From

PG Bug reporting form

Date:

04 October 2019, 00:29:44

The following bug has been logged on the website:

Bug reference:      16036
Logged by:          Антон Власов
Email address:      druidvav@gmail.com
PostgreSQL version: 12.0
Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
Description:

I am having segmentation fault like this:

2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
terminated by signal 11: Segmentation fault
2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
update tracking.test_session set list_position = list_position + 1 where id
= $1
2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active server
processes

When there are several queries running simultaneously. Tried to get some
info from coredump, but that's all i have:

(gdb) bt
#0  GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
Backtrace stopped: Cannot access memory at address 0x7ffc990e7218

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 00:55:49

Got normal backtrace:

#0  GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000564c3f007aca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000564c3f1b5889 in tts_buffer_heap_clear (slot=0x564c40650298) at
./build/../src/backend/executor/execTuples.c:652
#4  0x0000564c3f1b5cbe in ExecClearTuple (slot=0x564c40650298) at ./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x564c4068a6b0, shouldFree=false) at
./build/../src/backend/executor/execTuples.c:1165
#6  0x0000564c3f1abc40 in ExecEndPlan (estate=0x564c40689740, planstate=<optimized out>) at
./build/../src/backend/executor/execMain.c:1560
#7  standard_ExecutorEnd (queryDesc=0x564c40663d80) at ./build/../src/backend/executor/execMain.c:496
#8  0x0000564c3f30a6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x564c40663c60 "update tracking.test_session
setlist_position = list_position + 1 where id = $1", params=0x564c40663cf0, queryEnv=0x0, dest=0x564c3f8b1c00
<donothingDR>,
    completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204
#9  0x0000564c3f30a8c3 in PortalRunMulti (portal=portal@entry=0x564c405cb3b0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=0x564c3f8b1c00 <donothingDR>, dest@entry=0x564c4052dc90,
altdest=0x564c3f8b1c00<donothingDR>, 
    altdest@entry=0x564c4052dc90, completionTag=completionTag@entry=0x7ffd3557c720 "UPDATE 0") at
./build/../src/backend/tcop/pquery.c:1283
#10 0x0000564c3f30b4db in PortalRun (portal=portal@entry=0x564c405cb3b0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=<optimized out>, dest=dest@entry=0x564c4052dc90,
altdest=altdest@entry=0x564c4052dc90,
    completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796
#11 0x0000564c3f308d05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x564c4052d880 "") at
./build/../src/backend/tcop/postgres.c:2090
#12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x564c40578e80, dbname=<optimized out>, username=<optimized
out>)at ./build/../src/backend/tcop/postgres.c:4297 
#13 0x0000564c3effcbe1 in BackendRun (port=0x564c40574ba0, port=0x564c40574ba0) at
./build/../src/backend/postmaster/postmaster.c:4431
#14 BackendStartup (port=0x564c40574ba0) at ./build/../src/backend/postmaster/postmaster.c:4122
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#16 0x0000564c3f28fe81 in PostmasterMain (argc=5, argv=0x564c40524b70) at
./build/../src/backend/postmaster/postmaster.c:1377
#17 0x0000564c3effdea8 in main (argc=5, argv=0x564c40524b70) at ./build/../src/backend/main/main.c:228

> 4 окт. 2019 г., в 3:29, PG Bug reporting form <noreply@postgresql.org> написал(а):
>
> The following bug has been logged on the website:
>
> Bug reference:      16036
> Logged by:          Антон Власов
> Email address:      druidvav@gmail.com
> PostgreSQL version: 12.0
> Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
> Description:
>
> I am having segmentation fault like this:
>
> 2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
> terminated by signal 11: Segmentation fault
> 2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
> update tracking.test_session set list_position = list_position + 1 where id
> = $1
> 2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active server
> processes
>
> When there are several queries running simultaneously. Tried to get some
> info from coredump, but that's all i have:
>
> (gdb) bt
> #0  GetMemoryChunkContext (pointer=0x0) at
> ./build/../src/include/utils/memutils.h:127
> #1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
> Backtrace stopped: Cannot access memory at address 0x7ffc990e7218
>

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 01:09:01

Hi,

On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <noreply@postgresql.org> wrote:
>The following bug has been logged on the website:
>
>Bug reference:      16036
>Logged by:          Антон Власов
>Email address:      druidvav@gmail.com
>PostgreSQL version: 12.0
>Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
>Description:
>
>I am having segmentation fault like this:
>
>2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
>terminated by signal 11: Segmentation fault
>2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
>update tracking.test_session set list_position = list_position + 1
>where id
>= $1
>2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active
>server
>processes
>
>When there are several queries running simultaneously. Tried to get
>some
>info from coredump, but that's all i have:
>
>(gdb) bt
>#0  GetMemoryChunkContext (pointer=0x0) at
>./build/../src/include/utils/memutils.h:127
>#1  pfree (pointer=0x0) at
>./build/../src/backend/utils/mmgr/mcxt.c:1033
>Backtrace stopped: Cannot access memory at address 0x7ffc990e7218

Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner?

If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace after
settingjit_debugging_support=1? 


Regards,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 01:14:33

Hello,

Looks like disabling jit didn’t affect backtrace at all:

0  GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) at
./build/../src/backend/executor/execTuples.c:652
#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at ./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) at
./build/../src/backend/executor/execTuples.c:1165
#6  0x0000555d7290ec40 in ExecEndPlan (estate=0x555d73db4aa0, planstate=<optimized out>) at
./build/../src/backend/executor/execMain.c:1560
#7  standard_ExecutorEnd (queryDesc=0x555d73ce1f40) at ./build/../src/backend/executor/execMain.c:496
#8  0x0000555d72a6d6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x555d73ce1e20 "update tracking.test_session
setlist_position = list_position + 1 where id = $1", params=0x555d73ce1eb0, queryEnv=0x0, dest=0x555d73014c00
<donothingDR>,
    completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204
#9  0x0000555d72a6d8c3 in PortalRunMulti (portal=portal@entry=0x555d73c4d3f0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=0x555d73014c00 <donothingDR>, dest@entry=0x555d73bafc90,
altdest=0x555d73014c00<donothingDR>, 
    altdest@entry=0x555d73bafc90, completionTag=completionTag@entry=0x7ffdf0913200 "UPDATE 0") at
./build/../src/backend/tcop/pquery.c:1283
#10 0x0000555d72a6e4db in PortalRun (portal=portal@entry=0x555d73c4d3f0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=<optimized out>, dest=dest@entry=0x555d73bafc90,
altdest=altdest@entry=0x555d73bafc90,
    completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796
#11 0x0000555d72a6bd05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x555d73baf880 "") at
./build/../src/backend/tcop/postgres.c:2090
#12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x555d73bfb2e0, dbname=<optimized out>, username=<optimized
out>)at ./build/../src/backend/tcop/postgres.c:4297 
#13 0x0000555d7275fbe1 in BackendRun (port=0x555d73bf6be0, port=0x555d73bf6be0) at
./build/../src/backend/postmaster/postmaster.c:4431
#14 BackendStartup (port=0x555d73bf6be0) at ./build/../src/backend/postmaster/postmaster.c:4122
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#16 0x0000555d729f2e81 in PostmasterMain (argc=5, argv=0x555d73ba6b70) at
./build/../src/backend/postmaster/postmaster.c:1377
#17 0x0000555d72760ea8 in main (argc=5, argv=0x555d73ba6b70) at ./build/../src/backend/main/main.c:228

> 4 окт. 2019 г., в 4:09, Andres Freund <andres@anarazel.de> написал(а):
>
> Hi,
>
> On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <noreply@postgresql.org> wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      16036
>> Logged by:          Антон Власов
>> Email address:      druidvav@gmail.com
>> PostgreSQL version: 12.0
>> Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
>> Description:
>>
>> I am having segmentation fault like this:
>>
>> 2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
>> terminated by signal 11: Segmentation fault
>> 2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
>> update tracking.test_session set list_position = list_position + 1
>> where id
>> = $1
>> 2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active
>> server
>> processes
>>
>> When there are several queries running simultaneously. Tried to get
>> some
>> info from coredump, but that's all i have:
>>
>> (gdb) bt
>> #0  GetMemoryChunkContext (pointer=0x0) at
>> ./build/../src/include/utils/memutils.h:127
>> #1  pfree (pointer=0x0) at
>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>> Backtrace stopped: Cannot access memory at address 0x7ffc990e7218
>
> Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner?
>
> If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace
aftersetting jit_debugging_support=1? 
>
>
> Regards,
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 01:20:15

Hi,

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
>Hello,
>
>Looks like disabling jit didn’t affect backtrace at all:
>
>0  GetMemoryChunkContext (pointer=0x0) at
>./build/../src/include/utils/memutils.h:127
>#1  pfree (pointer=0x0) at
>./build/../src/backend/utils/mmgr/mcxt.c:1033
>#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>./build/../src/backend/access/common/heaptuple.c:1340
>#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>at ./build/../src/backend/executor/execTuples.c:652
>#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>./build/../src/include/executor/tuptable.h:428
>#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work
betterthis time? 

Are there any details that you can provide? Schema? Any extensions in use?

Does the problem happen always, or just under concurrency?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 01:24:24

Hello,

This test script always causes a crash:

postgres@gp-app01:~$ cat test.sh
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &

create table tracking.test_result
(
    id serial not null
        constraint test_result_pkey
            primary key,
    session_id integer
        constraint test_result_session_id_fkey
            references tracking.test_session,
    reported_at timestamp(0) default now(),
    courier varchar(25) not null,
    tracking_number varchar(60) not null,
    status varchar(25) not null,
    fail_combo integer default 0 not null,
    response_test text,
    response_real text,
    response_time real,
    code smallint,
    message text,
    retry integer
);

create index test_result_session_id_idx
    on tracking.test_result (session_id);

Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual, only
concurrentones cause crash. 


> 4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):
>
> Hi,
>
> On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
>> Hello,
>>
>> Looks like disabling jit didn’t affect backtrace at all:
>>
>> 0  GetMemoryChunkContext (pointer=0x0) at
>> ./build/../src/include/utils/memutils.h:127
>> #1  pfree (pointer=0x0) at
>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>> #2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>> ./build/../src/backend/access/common/heaptuple.c:1340
>> #3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>> at ./build/../src/backend/executor/execTuples.c:652
>> #4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>> ./build/../src/include/executor/tuptable.h:428
>> #5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
>
> That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work
betterthis time? 
>
> Are there any details that you can provide? Schema? Any extensions in use?
>
> Does the problem happen always, or just under concurrency?
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 01:26:25

Tried only two concurrent queries and crash is still present, so it’s not about total query count.

> 4 окт. 2019 г., в 4:24, Антон Власов <druidvav@gmail.com> написал(а):
>
> Hello,
>
> This test script always causes a crash:
>
> postgres@gp-app01:~$ cat test.sh
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
>
> create table tracking.test_result
> (
>     id serial not null
>         constraint test_result_pkey
>             primary key,
>     session_id integer
>         constraint test_result_session_id_fkey
>             references tracking.test_session,
>     reported_at timestamp(0) default now(),
>     courier varchar(25) not null,
>     tracking_number varchar(60) not null,
>     status varchar(25) not null,
>     fail_combo integer default 0 not null,
>     response_test text,
>     response_real text,
>     response_time real,
>     code smallint,
>     message text,
>     retry integer
> );
>
> create index test_result_session_id_idx
>     on tracking.test_result (session_id);
>
> Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual,
onlyconcurrent ones cause crash. 
>
>
>> 4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):
>>
>> Hi,
>>
>> On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
>>> Hello,
>>>
>>> Looks like disabling jit didn’t affect backtrace at all:
>>>
>>> 0  GetMemoryChunkContext (pointer=0x0) at
>>> ./build/../src/include/utils/memutils.h:127
>>> #1  pfree (pointer=0x0) at
>>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>>> #2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>>> ./build/../src/backend/access/common/heaptuple.c:1340
>>> #3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>>> at ./build/../src/backend/executor/execTuples.c:652
>>> #4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>>> ./build/../src/include/executor/tuptable.h:428
>>> #5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
>>
>> That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work
betterthis time? 
>>
>> Are there any details that you can provide? Schema? Any extensions in use?
>>
>> Does the problem happen always, or just under concurrency?
>>
>> Andres
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 01:45:53

Tried with default debian configuration - problem persists. I’ve got full backtrace when reduced shared_buffers from 18G to 1G. Tried to simulate the problem in artificial environment, but cannot repeat it.

On real data it is 100% repeatable.

4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use?

Does the problem happen always, or just under concurrency?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 01:52:25

At last, i’ve found the source of the problem.

I have a trigger:

create function test_session_bu() returns trigger
    language plpgsql
as
$$
begin
  if new.session_end is not null and old.session_end is null then
    new.is_finished   := true;
    new.success_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'success');
    new.failing_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'failing');
    new.fail_count    := (select count(*) from tracking.test_result where session_id = new.id and status = 'fail');
  end if;
  return new;
end;
$$;

create trigger test_session_bu
    before update
    on tracking.test_session
    for each row
execute procedure tracking.test_session_bu();

Everything works fine without it, when i return it — problem returns too.

As you can see session_end is not affected in the queries. So i tried to modify the trigger to:

create or replace function test_session_bu() returns trigger
    language plpgsql
as
$$
begin
  return new;
end;
$$;

And problem was still in place.

4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use?

Does the problem happen always, or just under concurrency?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Антон Власов

Date:

04 October 2019, 02:11:16

Hello!

Reproducible schema:


create table test_table (
    id serial, field int default 0 not null
);

create or replace function test_table_bu() returns trigger
    language plpgsql
as
$$
begin
  return new;
end;
$$;

create trigger test_table_bu
    before update
    on test_table
    for each row
execute procedure test_table_bu();

insert into test_table (field) values (0);

test.sh:

psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &

4 окт. 2019 г., в 4:20, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <druidvav@gmail.com> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0 GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1 pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2 0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3 0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4 0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5 ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use?

Does the problem happen always, or just under concurrency?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 03:20:20

Hi,

On 2019-10-04 05:11:16 +0300, Антон Власов wrote:
> Reproducible schema:
> 
> create table test_table (
>     id serial, field int default 0 not null
> );
> 
> create or replace function test_table_bu() returns trigger
>     language plpgsql
> as
> $$
> begin
>   return new;
> end;
> $$;
> 
> create trigger test_table_bu
>     before update
>     on test_table
>     for each row
> execute procedure test_table_bu();
> 
> insert into test_table (field) values (0);
> test.sh:
> 
> psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &
> psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &

Thanks, that's very helpful!

I've started to narrow down the problem:

bool
ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
                     ResultRelInfo *relinfo,
                     ItemPointer tupleid,
                     HeapTuple fdw_trigtuple,
                     TupleTableSlot *newslot)
{
...

        /*
         * In READ COMMITTED isolation level it's possible that target tuple
         * was changed due to concurrent update.  In that case we have a raw
         * subplan output tuple in newSlot, and need to run it through the
         * junk filter to produce an insertable tuple.
         *
         * Caution: more than likely, the passed-in slot is the same as the
         * junkfilter's output slot, so we are clobbering the original value
         * of slottuple by doing the filtering.  This is OK since neither we
         * nor our caller have any more interest in the prior contents of that
         * slot.
         */
        if (newSlot != NULL)
        {
            TupleTableSlot *slot = ExecFilterJunk(relinfo->ri_junkFilter, newSlot);

            ExecCopySlot(newslot, slot);
        }

the problem is that 'newslot' and 'slot' are the same. As copying first
clears the target slot, that results in tuple slot in a bogus state, with
BufferHeapTupleTableSlot->base.tuple = NULL, but TTS_FLAG_EMPTY not set,
but TTS_FLAG_SHOULDFREE set.

The only reason that doesn't run into the assert in
ExecCopySlotHeapTuple() that the slot bet not be empty, is that we've
already cleared that on the destination slot (which is also the source
slot) at that point...

        dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
        oldContext = MemoryContextSwitchTo(dstslot->tts_mcxt);
        bdstslot->base.tuple = ExecCopySlotHeapTuple(srcslot);
        MemoryContextSwitchTo(oldContext);

So I think what we need to do is:
1) only call ExecCopySlot() if slot != newslot - this fixes the crash
2) Add an assert to ExecCopySlot() ensuring source and target slot are
   distinct
3) Figure out why our tests didn't catch this, this afaict should be a
   tested scenario
4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers


Greetings,

Andres Freund

Re: BUG #16036: Segmentation fault while doing an update

From

Tom Lane

Date:

04 October 2019, 03:26:01

Andres Freund <andres@anarazel.de> writes:
> So I think what we need to do is:
> 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> 2) Add an assert to ExecCopySlot() ensuring source and target slot are
>    distinct
> 3) Figure out why our tests didn't catch this, this afaict should be a
>    tested scenario
> 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers

Maybe instead of 1+2, change ExecCopySlot like this:

-    dstslot->tts_ops->copyslot(dstslot, srcslot);
+    if (dstslot != srcslot)
+        dstslot->tts_ops->copyslot(dstslot, srcslot);

That would fix other similar instances, not just this one caller.

            regards, tom lane

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 03:36:10

Hi,

On 2019-10-03 23:26:01 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So I think what we need to do is:
> > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> > 2) Add an assert to ExecCopySlot() ensuring source and target slot are
> >    distinct
> > 3) Figure out why our tests didn't catch this, this afaict should be a
> >    tested scenario
> > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers
> 
> Maybe instead of 1+2, change ExecCopySlot like this:
> 
> -    dstslot->tts_ops->copyslot(dstslot, srcslot);
> +    if (dstslot != srcslot)
> +        dstslot->tts_ops->copyslot(dstslot, srcslot);
> 
> That would fix other similar instances, not just this one caller.

Yea, I was wondering about that too - but asking for a slot to be copied
into itself seems to indicate a logic bug more often than not? Expecting
the slots to be independent, afterwards. So I'm a bit hesitant doing so.

Greetings,

Andres Freund

Re: BUG #16036: Segmentation fault while doing an update

From

Tom Lane

Date:

04 October 2019, 03:42:43

Andres Freund <andres@anarazel.de> writes:
> Yea, I was wondering about that too - but asking for a slot to be copied
> into itself seems to indicate a logic bug more often than not? Expecting
> the slots to be independent, afterwards. So I'm a bit hesitant doing so.

Oh, good point.  The Assert sounds better then.

            regards, tom lane

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 04:30:21

Hi,

On 2019-10-03 23:26:01 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So I think what we need to do is:
> > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> > 2) Add an assert to ExecCopySlot() ensuring source and target slot are
> >    distinct
> > 3) Figure out why our tests didn't catch this, this afaict should be a
> >    tested scenario
> > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers
> 
> Maybe instead of 1+2, change ExecCopySlot like this:
> 
> -    dstslot->tts_ops->copyslot(dstslot, srcslot);
> +    if (dstslot != srcslot)
> +        dstslot->tts_ops->copyslot(dstslot, srcslot);
> 
> That would fix other similar instances, not just this one caller.

Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
not much for the trigger case.

Greetings,

Andres Freund

Attachment

wip-fix-16036.patch

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 19:28:48

Hi,

On 2019-10-03 21:30:21 -0700, Andres Freund wrote:
> Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
> coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
> not much for the trigger case.

Here's both the fix and tests.

I think the test adds a good bit coverage. What's lacking is primarily
testing foreign table related paths, but that seems separate enough, and
would need to be done in a separate contrib/ test anyway.

I think there's a good case to be made to backpatch the tests further
than 12, but I'm not sure about it? They do pass (with one error message
about a failure to delete changed to a failure to update, we didn't use
to be able to discern) back to 9.6, requiring

commit 38f8bdcac4982215beb9f65a19debecaf22fd470
Author: Robert Haas <rhaas@postgresql.org>
Date:   2016-02-11 08:36:30 -0500

    Modify the isolation tester so that multiple sessions can wait.

Greetings,

Andres Freund

Attachment

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 21:43:00

Hi,

Thanks again for the bug report! I've pushed the fix, and tests. They'll
be included in 12.1 (which I assume will happen relatively soon), or you
can compile a slightly newer version yourself.

Regards,

Andres


On 2019-10-04 12:28:48 -0700, Andres Freund wrote:
> On 2019-10-03 21:30:21 -0700, Andres Freund wrote:
> > Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
> > coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
> > not much for the trigger case.

> Here's both the fix and tests.
> 
> I think the test adds a good bit coverage. What's lacking is primarily
> testing foreign table related paths, but that seems separate enough, and
> would need to be done in a separate contrib/ test anyway.
> 
> I think there's a good case to be made to backpatch the tests further
> than 12, but I'm not sure about it? They do pass (with one error message
> about a failure to delete changed to a failure to update, we didn't use
> to be able to discern) back to 9.6, requiring

I did push the tests back to 9.6.

Greetings,

Andres Freund

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

04 October 2019, 22:24:37

Hi,

On 2019-10-04 14:43:00 -0700, Andres Freund wrote:
> > I think there's a good case to be made to backpatch the tests further
> > than 12, but I'm not sure about it? They do pass (with one error message
> > about a failure to delete changed to a failure to update, we didn't use
> > to be able to discern) back to 9.6, requiring
> 
> I did push the tests back to 9.6.

There's a few ordering violations in the tests, e.g.:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13

 key-a          val-a-s1-ups1-ups1
 step s1_c: COMMIT;
-s3: NOTICE:  upd: text key-a = text key-a: t
-s3: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
-s3: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-ups3)
-s3: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-ups3)
-step s3_upd_a_data: <... completed>
+s2: NOTICE:  upd: text key-a = text key-a: t
+s2: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
+s2: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-upserts2)
+s2: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-upserts2)
+step s2_upsert_a_data: <... completed>
 key            data           

I was under the assumption that it'd be deterministic who gets to
continue with a speculative insertion, but that ain't so.

Peter, do you see an easy way around that? Do you consider that a
problem? ISTM we ought to make this fair, but that doing so would
require a bit bigger changes that we'd want to make in the backbranches?

Unless somebody comes up with an idea, I'll disable (or rewrite) the
"three way" tests involving two waiters for one insertion.

Regards,

Andres

Re: BUG #16036: Segmentation fault while doing an update

From

Peter Geoghegan

Date:

04 October 2019, 22:40:57

On Fri, Oct 4, 2019 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:
> I was under the assumption that it'd be deterministic who gets to
> continue with a speculative insertion, but that ain't so.

There is no heap_lock_tuple() style lock arbitration built in to
speculative insertions:

https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards

I do not consider this to be a problem.

> Peter, do you see an easy way around that? Do you consider that a
> problem? ISTM we ought to make this fair, but that doing so would
> require a bit bigger changes that we'd want to make in the backbranches?

It would also require formally defining "fair", which doesn't seem
straightforward to me. The MVCC snapshot is barely involved at all.

-- 
Peter Geoghegan

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

05 October 2019, 04:45:35

Hi,

On 2019-10-04 15:24:37 -0700, Andres Freund wrote:
> On 2019-10-04 14:43:00 -0700, Andres Freund wrote:
> > > I think there's a good case to be made to backpatch the tests further
> > > than 12, but I'm not sure about it? They do pass (with one error message
> > > about a failure to delete changed to a failure to update, we didn't use
> > > to be able to discern) back to 9.6, requiring
> > 
> > I did push the tests back to 9.6.
> 
> There's a few ordering violations in the tests, e.g.:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13

I've now disabled that portion of the test. It look to me like this
might be a subtle upsert bug.  What happens is that in the
isolationtester sequence (all operating on one row, and all already in a
transaction):
    "s1_upd_a_data" "s3_upd_a_data" "s2_upsert_a_data" "s1_upd_a_data" "s1_c" "s3_del_a" "s3_c" "s2_c"

sometimes the s2_upsert_a_data gets to update the row before the
s3_upd_a_data. That does *not* happen with a plain update, even though
there's at that point always a target row to operate on (i.e. this
happens before s3_del_a is reached). I think the cause for that might be
that the check_exclusion_or_unique_constraint() check done for upsert
(via ExecCheckIndexConstraints() in ExecInsert()) does a
XactLockTableWait() without, like heapam.c would, ordering via a tuple
lock.

Since it's clearly not 12 specific, and not really related to triggers /
EPQ, I think disabling it for now is a sensible choice.


>  key-a          val-a-s1-ups1-ups1
>  step s1_c: COMMIT;
> -s3: NOTICE:  upd: text key-a = text key-a: t
> -s3: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
> -s3: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-ups3)
> -s3: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-ups3)
> -step s3_upd_a_data: <... completed>
> +s2: NOTICE:  upd: text key-a = text key-a: t
> +s2: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
> +s2: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-upserts2)
> +s2: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new:
(key-a,val-a-s1-ups1-ups1-upserts2)
> +step s2_upsert_a_data: <... completed>
>  key            data           
> 
> I was under the assumption that it'd be deterministic who gets to
> continue with a speculative insertion, but that ain't so.

This was a preliminary theory, a wrong one: Speculative insertions
aren't actually involved, see above...

Greetings,

Andres Freund

Re: BUG #16036: Segmentation fault while doing an update

From

Tom Lane

Date:

05 October 2019, 15:30:39

Andres Freund <andres@anarazel.de> writes:
> I've now disabled that portion of the test.

There's still a large fraction of the buildfarm that's unhappy :-(

            regards, tom lane

Re: BUG #16036: Segmentation fault while doing an update

From

Andres Freund

Date:

05 October 2019, 16:26:59

Hi,

On 2019-10-05 11:30:39 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I've now disabled that portion of the test.
> 
> There's still a large fraction of the buildfarm that's unhappy :-(

I pushed a second patch disabling two more permutations, I've not seen
any related failures since.  It's frustrating that we at the moment seem
to not have deterministic behaviour with two sessions waiting for a
tuple lock at the moment. Doesn't look like that's new though.

It seems to me that most of the XactLockTableWait() users ought to
follow the pattern outlined at the top of README.tuplock, but
don't.

Greetings,

Andres Freund