Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs

From Craig Milhiser
Subject Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date
Msg-id CA+wnhO3kT7mxbadbteppSZVcs3FRPvGL5xTY2DbrWSGohmhyVw@mail.gmail.com
Whole thread Raw
In response to Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
List pgsql-bugs
> Since you're building from source, you could try applying the patch
>posted by Andrei Lephikov:
https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

This did not work for me. I am running out of memory.

I applied the patch, make clean, make, make check, sudo make install. I am running out of the box Postgres configuration.

Memory below uses "free -m".

Before loading Postgres
               total        used        free      shared  buff/cache   available
Mem:           31388         669       30467           2         639       30719
Swap:              0           0           0

After loading
               total        used        free      shared  buff/cache   available
Mem:           31388         672       30464          14         651       30715
Swap:              0           0           0

I go into psql
set max_parallel_workers_per_gather = 0;
run the query multiple times, takes 9.5 seconds at steady state, returns 20 rows.

Memory is still available

               total        used        free      shared  buff/cache   available
Mem:           31388         921       22547         142        8460       30466
Swap:              0           0           0

In the same psql session, set max_parallel_workers_per_gather = 2; then run the query again. This runs for 1 minute then:

2024-10-01 18:28:45.883 UTC [2586] LOG:  background worker "parallel worker" (PID 4465) was terminated by signal 9: Killed
2024-10-01 18:28:45.883 UTC [2586] DETAIL:  Failed process was running: SELECT
      ...          
2024-10-01 18:28:45.883 UTC [2586] LOG:  terminating any other active server processes
2024-10-01 18:28:46.620 UTC [2586] LOG:  all server processes terminated; reinitializing

I got this as close to the end as I could
               total        used        free      shared  buff/cache   available
Mem:           31388       31014         535        1955        2156         373
Swap:              0           0           0

Though OOM conditions often means all bets are off for behavior, I tried something different. I rebooted, started Postgres then run the query. I do not set parallel_... = 0 and run the query which populated the cache. The machine exhausts memory again but usually "hangs". I need to restart.  Below is the frozen screen
               total        used        free      shared  buff/cache   available
Mem:           31388       31317         240        1955        2140          70
Swap:              0           0           0

I ran these sequences multiple times. I also analyzed the data again just to make sure.

I reverted the patch to make sure I am reproducing the issue. I get the same 1.8GB allocation failure with parallel. Without parallel the query takes ~10 seconds. The patch increased the single worker performance for this query for out of the box configuration by 5%. 

Thanks

On Sun, Sep 29, 2024 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote:
> I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzed and ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well.

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number).  I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Linux OOM killer
Next
From: Andrei Lepikhov
Date:
Subject: Re: BUG #18643: EXPLAIN estimated rows mismatch