Thread: Re: PostgreSQL 17 Segmentation Fault

Re: PostgreSQL 17 Segmentation Fault

From

Tomas Vondra

Date:

04 October 2024, 22:11:15

Hi,

Thanks for the provided information. Per the backtrace, the failure
happens in the LLVM JIT code in  nestloop/seqscan, so it has to be in
this part of the plan:

  ->  Nested Loop  (cost=0.42..6074.84 rows=117 width=641)
        ->  Parallel Seq Scan on tasks__projects  (cost=0.00..2201.62
rows=745 width=16)
              Filter: (gid = '1138791545416725'::text)
        ->  Index Scan using tasks_pkey on tasks tasks_1
(cost=0.42..5.20 rows=1 width=102)
              Index Cond: (gid = tasks__projects._sdc_source_key_gid)
              Filter: ((NOT completed) AND (name <> ''::text))

But it's not clear why this should consume a lot of memory, though. It's
possible the memory is consumed elsewhere, and this is simply the straw
that breaks the camel's back ...

Presumably it takes a while for the query to consume a lot of memory and
crash - can you attach a debugger to it after after it allocates a lot
of memory (but before the crash), and do this:

  call MemoryContextStats(TopMemoryContext)

That should write memory context stats to the server log. Perhaps that
will tell us which part of the query allocates memory.

Next, try running the query with jit=off. If that resolves the problem,
maybe it's another JIT issue. But if it completes with lower shared
buffers, that doesn't seem likely.

The plan has a bunch of hash joins. I wonder if that might be causing
issues, because the hash tables may be kept until the end of the query,
and each may be up to 64MB (you have work_mem=32, but there's also 2x
multiplier since PG13). The row estimates are pretty low, but could it
be that the real row counts are much higher? Did you run analyze after
the upgrade? Maybe try with lower work_mem?

One last thing you should check is memory overcommit. Chances are it's
set just low enough for the query to hit it with SB=4GB, but not with
SB=3GB. In that case you may need to tune this a bit. See /proc/meminfo
and /proc/sys/vm/overcommit_*).


regards

-- 
Tomas Vondra

Re: PostgreSQL 17 Segmentation Fault

From

Cameron Vogt

Date:

05 October 2024, 01:17:47

The query crashes less than a second after running it, so there isn't much time to consume memory or to try attaching GDB mid-query. I tried decreasing work_mem from 32MB to 128kB, but I still get the error. I've also ran vacuum and analyze to no avail. When the query is successful, it only yields 68 rows, so I don't think the row estimates are too far off. I checked the files you mentioned for memory overcommit:

/proc/sys/vm/overcommit_memory = 0

/proc/sys/vm/overcommit_kbytes = 0

/proc/sys/vm/overcommit_ratio = 50

The free RAM on the system starts at and hangs around 8GB while executing the crashing query.

The only two things that have fixed the issue so far: Turning JIT off or decreasing shared_buffers. I suppose then that it might be a JIT issue?

Cameron Vogt | Software Developer
Direct: 314-756-2302 | Cell: 636-388-2050
1585 Fencorp Drive | Fenton, MO 63026
Automatic Controls Equipment Systems, Inc.

Re: PostgreSQL 17 Segmentation Fault

From

Thomas Munro

Date:

05 October 2024, 04:40:17

On Sat, Oct 5, 2024 at 11:30 AM Cameron Vogt
<cvogt@automaticcontrols.net> wrote:
> I suppose then that it might be a JIT issue?

I see from your info.txt file that this is aarch64.  Could it be an
instance of LLVM's ARM relocation bug[1]?  I'm planning to push that
fix taken from the LLVM project soon, I have just been waiting to see
if a more polished version would land in LLVM's main branch first, but
I'm about to give up waiting for that so we get some testing time
in-tree before our next minor release.

[1] https://www.postgresql.org/message-id/flat/CAO6_Xqr63qj%3DSx7HY6ZiiQ6R_JbX%2B-p6sTPwDYwTWZjUmjsYBg%40mail.gmail.com

Re: PostgreSQL 17 Segmentation Fault

From

Cameron Vogt

Date:

06 October 2024, 01:11:54

On 2024-10-05 01:40:17 Thomas Munro

<thomas(dot)munro(at)gmail(dot)com> wrote:

> Could it be an instance of LLVM's ARM relocation bug?

After reading about the bug, I believe you are likely correct. That would explain the behavior I'm seeing with JIT and shared_buffers. When I migrated PostgreSQL versions, I also moved to a new aarch64 machine. The old machine was not aarch64, so that may explain the timing of the issue as well.

Cameron Vogt | Software Developer
Direct: 314-756-2302 | Cell: 636-388-2050
1585 Fencorp Drive | Fenton, MO 63026
Automatic Controls Equipment Systems, Inc.