Thread: Urgent: Segmentation Fault in PostgreSQL postmaster Process
Check if any disk issues
Storage is local or SAN , if SAN see if any hiccups to communicate SAN
Check disk space usage
From: Veerendra Pulapa <veerendra.pulapa@ashnik.com>
Sent: Sunday, June 16, 2024 7:49 PM
To: pgsql-admin <pgsql-admin@postgresql.org>; pgsql-admin@lists.postgresql.org
Subject: Urgent: Segmentation Fault in PostgreSQL postmaster Process
Dear Team,
I am experiencing a segmentation fault issue with the postmaster process of our PostgreSQL database, and I am seeking your assistance. Below are the details of the error and our system configuration.
Issue Description
We encountered a segmentation fault in the postmaster process of our PostgreSQL instance. The relevant system log entry is as follows:
DB Logs:
LOG: restartpoint starting: time
LOG: startup process (PID 21704) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
LOG: database system is shut down
LOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
LOG: listening on IPv4 address "0.0.0.0", port 5432
LOG: listening on IPv6 address "::", port 5432
LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
LOG: database system was interrupted while in recovery at log time 2024-06-16 19:37:48 +08
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
OS logs:
kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]
PostgreSQL Version and Environment:
PostgreSQL Version: 13.2
Operating System: rhel 7.9
Kernel Version: 3.10.0-1160.118.1.el7.x86_64
System Specifications: [32 vCPU, 250GB]
______________________________________________________________________________________
This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s). If you are not the rightful recipient of this email, please delete this email immediately and inform the recipient.
OS logs:kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]PostgreSQL Version and Environment:PostgreSQL Version: 13.2Operating System: rhel 7.9Kernel Version: 3.10.0-1160.118.1.el7.x86_64System Specifications: [32 vCPU, 250GB]
Dear Team,I am experiencing a segmentation fault issue with the postmaster process of our PostgreSQL database, and I am seeking your assistance. Below are the details of the error and our system configuration.Issue DescriptionWe encountered a segmentation fault in the postmaster process of our PostgreSQL instance. The relevant system log entry is as follows:DB Logs:LOG: restartpoint starting: timeLOG: startup process (PID 21704) was terminated by signal 11: Segmentation faultLOG: terminating any other active server processesLOG: database system is shut downLOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bitLOG: listening on IPv4 address "0.0.0.0", port 5432LOG: listening on IPv6 address "::", port 5432LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"LOG: database system was interrupted while in recovery at log time 2024-06-16 19:37:48 +08HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
This runs as a primary right? The logs indicate that the postmaster had already crashed and was trying to recover while it threw another segfault.
Give us some historical background of any events that happened in the course of your installation that might be of any interest. Such as, how often do you get the problem, how long are you having this problem, what did change before the problem started to appear (software/hardware).
IME, such errors could be due to either faulty CPU or memory. I'd run a memtest and run diagnostics (BIOS), check temperatures, etc
OS logs:kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]PostgreSQL Version and Environment:PostgreSQL Version: 13.2Operating System: rhel 7.9Kernel Version: 3.10.0-1160.118.1.el7.x86_64System Specifications: [32 vCPU, 250GB]____________________________________________________________ __________________________ This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s). If you are not the rightful recipient of this email, please delete this email immediately and inform the recipient.
-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)
Check if any disk issues
Storage is local or SAN , if SAN see if any hiccups to communicate SAN
Check disk space usage
From: Veerendra Pulapa <veerendra.pulapa@ashnik.com>
Sent: Sunday, June 16, 2024 7:49 PM
To: pgsql-admin <pgsql-admin@postgresql.org>; pgsql-admin@lists.postgresql.org
Subject: Urgent: Segmentation Fault in PostgreSQL postmaster Process
Dear Team,
I am experiencing a segmentation fault issue with the postmaster process of our PostgreSQL database, and I am seeking your assistance. Below are the details of the error and our system configuration.
Issue Description
We encountered a segmentation fault in the postmaster process of our PostgreSQL instance. The relevant system log entry is as follows:
DB Logs:
LOG: restartpoint starting: time
LOG: startup process (PID 21704) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
LOG: database system is shut down
LOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
LOG: listening on IPv4 address "0.0.0.0", port 5432
LOG: listening on IPv6 address "::", port 5432
LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
LOG: database system was interrupted while in recovery at log time 2024-06-16 19:37:48 +08
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
OS logs:
kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]
PostgreSQL Version and Environment:
PostgreSQL Version: 13.2
Operating System: rhel 7.9
Kernel Version: 3.10.0-1160.118.1.el7.x86_64
System Specifications: [32 vCPU, 250GB]
______________________________________________________________________________________
This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s). If you are not the rightful recipient of this email, please delete this email immediately and inform the recipient.
On Mon, Jun 17, 2024, 5:19 AM Veerendra Pulapa <veerendra.pulapa@ashnik.com> wrote:OS logs:kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]PostgreSQL Version and Environment:PostgreSQL Version: 13.2Operating System: rhel 7.9Kernel Version: 3.10.0-1160.118.1.el7.x86_64System Specifications: [32 vCPU, 250GB]is it possible for you to do a minor upgrade to latest 13.12 version.it should be a minor upgrade , upgrade and restart might be all you need.iirc there were bugs in older 13.x versions that were fixed in later versions.
Στις 17/6/24 02:49, ο/η Veerendra Pulapa έγραψε:Dear Team,I am experiencing a segmentation fault issue with the postmaster process of our PostgreSQL database, and I am seeking your assistance. Below are the details of the error and our system configuration.Issue DescriptionWe encountered a segmentation fault in the postmaster process of our PostgreSQL instance. The relevant system log entry is as follows:DB Logs:LOG: restartpoint starting: timeLOG: startup process (PID 21704) was terminated by signal 11: Segmentation faultLOG: terminating any other active server processesLOG: database system is shut downLOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bitLOG: listening on IPv4 address "0.0.0.0", port 5432LOG: listening on IPv6 address "::", port 5432LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"LOG: database system was interrupted while in recovery at log time 2024-06-16 19:37:48 +08HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.This runs as a primary right? The logs indicate that the postmaster had already crashed and was trying to recover while it threw another segfault.
Give us some historical background of any events that happened in the course of your installation that might be of any interest. Such as, how often do you get the problem, how long are you having this problem, what did change before the problem started to appear (software/hardware).
IME, such errors could be due to either faulty CPU or memory. I'd run a memtest and run diagnostics (BIOS), check temperatures, etc
OS logs:kernel: postmaster[7844]: segfault at 2dd736b ip 00000000004ddc75 sp 00007ffe0e9c3490 error 6 in postgres[400000+73e000]PostgreSQL Version and Environment:PostgreSQL Version: 13.2Operating System: rhel 7.9Kernel Version: 3.10.0-1160.118.1.el7.x86_64System Specifications: [32 vCPU, 250GB]______________________________________________________________________________________This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s). If you are not the rightful recipient of this email, please delete this email immediately and inform the recipient.-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)
On Mon, 2024-06-17 at 18:30 +0530, Veerendra Pulapa wrote: > LOG: startup process (PID 21704) was terminated by signal 11: Segmentation fault If you want any support for that, you need to collect a core dump and generate a stack trace. Make sure that you have the debugging symbols installed. Also, we'd need to see the log messages *before* the lines you show. Everything since the start of the server. A segmentation fault is caused by a software bug. Since it happens during recovery, perhaps the WAL that is being replayed is corrupted and makes PostgreSQL fail. My recommendation is to restore your last good backup. Yours, Laurenz Albe
On Mon, 2024-06-17 at 18:30 +0530, Veerendra Pulapa wrote:
> LOG: startup process (PID 21704) was terminated by signal 11: Segmentation fault
If you want any support for that, you need to collect a core dump and
generate a stack trace. Make sure that you have the debugging symbols installed.
Also, we'd need to see the log messages *before* the lines you show. Everything
since the start of the server.
A segmentation fault is caused by a software bug. Since it happens during
recovery, perhaps the WAL that is being replayed is corrupted and makes
PostgreSQL fail.
My recommendation is to restore your last good backup.
Yours,
Laurenz Albe
On Wed, 2024-06-19 at 12:57 +0530, Veerendra Pulapa wrote: > Dear, > > I have run coredump in my current system but I want to understand the coredump > file and what went wrong with the signal 11 error(segmentation fault)? > > below are the coredump output: > > Program terminated with signal 11, Segmentation fault. > #0 0x00000000004ddc75 in _bt_swap_posting (newitem=newitem@entry=0x254bee8, oposting=oposting@entry=0x7fc6c1b3ee20, postingoff=1)at nbtdedup.c:800 That is revealing. Line number 800 has been a comment since version 13.4, so you must be running 13.3 or lower. The only commit that happened to the file between 13.3 and 13.4 is fa675af59f, which added a check to defend against a crash in connection with corrupted indexes. So I suggest that you update to 13.latest, as you should always do. See if the crash turns into an error message. Then you should try to rebuild the index with REINDEX. See if that gets rid of the problem. However, your server log suggests that you hit the crash while in crash recovery. In that case you won't get far enough to rebuild any indexes. Your options are probably to restore a backup or to venture "pg_resetwal" to get the system up. But "pg_resetwal" will destroy data and further corrupt your database, so take a backup before you do that. The goal of "pg_resetwal" is to get the server up so you can try to "pg_dump" the database and restore it somewhere else. Yours, Laurenz Albe
On Wed, 2024-06-19 at 12:57 +0530, Veerendra Pulapa wrote:
> Dear,
>
> I have run coredump in my current system but I want to understand the coredump
> file and what went wrong with the signal 11 error(segmentation fault)?
>
> below are the coredump output:
>
> Program terminated with signal 11, Segmentation fault.
> #0 0x00000000004ddc75 in _bt_swap_posting (newitem=newitem@entry=0x254bee8, oposting=oposting@entry=0x7fc6c1b3ee20, postingoff=1) at nbtdedup.c:800
That is revealing. Line number 800 has been a comment since version 13.4,
so you must be running 13.3 or lower.
The only commit that happened to the file between 13.3 and 13.4 is fa675af59f,
which added a check to defend against a crash in connection with corrupted indexes.
So I suggest that you update to 13.latest, as you should always do.
See if the crash turns into an error message.
Then you should try to rebuild the index with REINDEX. See if that gets rid
of the problem.
However, your server log suggests that you hit the crash while in crash recovery.
In that case you won't get far enough to rebuild any indexes.
Your options are probably to restore a backup or to venture "pg_resetwal" to
get the system up. But "pg_resetwal" will destroy data and further corrupt
your database, so take a backup before you do that.
The goal of "pg_resetwal" is to get the server up so you can try to "pg_dump"
the database and restore it somewhere else.
Yours,
Laurenz Albe
On Wed, 2024-06-19 at 13:32 +0530, Veerendra Pulapa wrote: > How do we check code 13.3 and 13.4 nbtdedup.c:800? > > Regarding this issue can we get any relevant information? Where can we find bug information? Huh? PostgreSQL is open source. I told you it is commit fa675af59f, so you can look at https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa675af59f It is also listed in the release notes of 13.4: https://www.postgresql.org/docs/13/release-13-4.html - Harden B-tree posting list split code against corrupt data (Peter Geoghegan) Throw an error, rather than crashing, for an attempt to insert an item with a TID identical to an existing entry. While that shouldn't ever happen, it has been reported to happen when the index is inconsistent with its table. Yours, Laurenz Albe
Hi, On Wed, Jun 19, 2024 at 10:08:14AM +0200, Laurenz Albe wrote: > On Wed, 2024-06-19 at 13:32 +0530, Veerendra Pulapa wrote: > > How do we check code 13.3 and 13.4 nbtdedup.c:800? > > > > Regarding this issue can we get any relevant information? Where can we find bug information? > > Huh? PostgreSQL is open source. > > I told you it is commit fa675af59f, so you can look at > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa675af59f > > It is also listed in the release notes of 13.4: > https://www.postgresql.org/docs/13/release-13-4.html > > - Harden B-tree posting list split code against corrupt data (Peter Geoghegan) > > Throw an error, rather than crashing, for an attempt to insert an item with a > TID identical to an existing entry. While that shouldn't ever happen, it has > been reported to happen when the index is inconsistent with its table. Right, and the reason why the index is inconsistent with its table is probably due to the ill-fated OS update you mentioned; if that was in-place and unless you REINDEXed all the text-column-based indexes, this might have lead to index corruption, so REINDEX your database after you upgraded to the latest minor release of version 13. Michael
Hi,
On Wed, Jun 19, 2024 at 10:08:14AM +0200, Laurenz Albe wrote:
> On Wed, 2024-06-19 at 13:32 +0530, Veerendra Pulapa wrote:
> > How do we check code 13.3 and 13.4 nbtdedup.c:800?
> >
> > Regarding this issue can we get any relevant information? Where can we find bug information?
>
> Huh? PostgreSQL is open source.
>
> I told you it is commit fa675af59f, so you can look at
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa675af59f
>
> It is also listed in the release notes of 13.4:
> https://www.postgresql.org/docs/13/release-13-4.html
>
> - Harden B-tree posting list split code against corrupt data (Peter Geoghegan)
>
> Throw an error, rather than crashing, for an attempt to insert an item with a
> TID identical to an existing entry. While that shouldn't ever happen, it has
> been reported to happen when the index is inconsistent with its table.
Right, and the reason why the index is inconsistent with its table is
probably due to the ill-fated OS update you mentioned; if that was
in-place and unless you REINDEXed all the text-column-based indexes,
this might have lead to index corruption, so REINDEX your database after
you upgraded to the latest minor release of version 13.
Michael
Dear All,
I hope this message finds you well.
I am reaching out to discuss an issue we recently encountered with our PostgreSQL setup, where a bug triggered on our standby servers before it affected the master. I am seeking insights into whether the resource differences between our servers could have played a role in this sequence of events.
Issue Overview:
- We observed a signal 11 (segmentation fault) error that first appeared on our standby servers and subsequently affected the master server.
- Our setup consists of a master server with higher resources and multiple standby servers with relatively lower resources.
Concerns:
- The standby servers have fewer resources compared to the master, which may have contributed to the bug being triggered on them first?
- We are considering whether the disparity in resources could lead to performance bottlenecks or stability issues, causing the standby servers to encounter the bug earlier than the master?
Request for Insights:
- Has anyone else experienced similar issues where bugs or faults are observed on standby servers before the master?
- Could the resource differences between the master and standby servers play a significant role in this behavior?
- Are there best practices for ensuring stability across servers with different resource allocations, especially in a High Availability (HA) setup?
I would greatly appreciate any insights, experiences, or suggestions you might have regarding this issue. Understanding the underlying reasons will help us optimize our setup and prevent future occurrences.
Thank you for your time and expertise.
yes you can2024년 6월 22일 (토) 오전 11:38, Veerendra Pulapa <veerendra.pulapa@ashnik.com>님이 작성:Hi All,Is there any way to reproduce the issue on different OS and Different DB versions?On Wed, Jun 19, 2024 at 1:42 PM Michael Banck <mbanck@gmx.net> wrote:Hi,
On Wed, Jun 19, 2024 at 10:08:14AM +0200, Laurenz Albe wrote:
> On Wed, 2024-06-19 at 13:32 +0530, Veerendra Pulapa wrote:
> > How do we check code 13.3 and 13.4 nbtdedup.c:800?
> >
> > Regarding this issue can we get any relevant information? Where can we find bug information?
>
> Huh? PostgreSQL is open source.
>
> I told you it is commit fa675af59f, so you can look at
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa675af59f
>
> It is also listed in the release notes of 13.4:
> https://www.postgresql.org/docs/13/release-13-4.html
>
> - Harden B-tree posting list split code against corrupt data (Peter Geoghegan)
>
> Throw an error, rather than crashing, for an attempt to insert an item with a
> TID identical to an existing entry. While that shouldn't ever happen, it has
> been reported to happen when the index is inconsistent with its table.
Right, and the reason why the index is inconsistent with its table is
probably due to the ill-fated OS update you mentioned; if that was
in-place and unless you REINDEXed all the text-column-based indexes,
this might have lead to index corruption, so REINDEX your database after
you upgraded to the latest minor release of version 13.
Michael______________________________________________________________________________________This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s). If you are not the rightful recipient of this email, please delete this email immediately and inform the recipient.
Veerendra Pulapa wrote: > I am reaching out to discuss an issue we recently encountered with our PostgreSQL > setup, where a bug triggered on our standby servers before it affected the master. > I am seeking insights into whether the resource differences between our servers > could have played a role in this sequence of events. > > Issue Overview: > > We observed a signal 11 (segmentation fault) error that first appeared on our > standby servers and subsequently affected the master server. Please don't hijack another thread, start your own. You should tell us the exact PostgreSQL version and operating system. Also, tell us exactly what you did to trigger the problem. Collect a core dump and get a stack trace (you need debugging symbols): https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend Yours, Laurenz Albe