Re: Postgres 9.2.13 on AIX 7.1 - Mailing list pgsql-bugs
From | Rainer Tammer |
---|---|
Subject | Re: Postgres 9.2.13 on AIX 7.1 |
Date | |
Msg-id | 5e4f9356-26cc-bd75-4f82-92d26ce575f7@spg.schulergroup.com Whole thread Raw |
In response to | Re: Postgres 9.2.13 on AIX 7.1 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Postgres 9.2.13 on AIX 7.1
|
List | pgsql-bugs |
Hello,
It did run the server with auto vacuum disabled for ~ 24h - no server shutdown.
After re-enabling auto vacuum the server dies in less then 9 hours:
Started: 2021-08-25 08:12:29
Dies: 2021-08-25 16:22:33
During the time of the shutdown there was no access to the server.
No running applications and no psql cli sessions.
I will let it run over night and see it the server is going down again.
There is no software installed on this AIX LPAR which uses this instance or sends signals to the server.
I did only do some interaction during the day to see if the server is working correctly.
Unfortunately I can not really see in the main process which other process did sent the signal SIGINT.
This is the only correlation I see:
2021-08-25 16:22:27 CEST DEBUG: server process (PID 19005776) exited with exit code 0
2021-08-25 16:22:33 CEST DEBUG: postmaster received signal 2
2021-08-25 16:22:33 CEST LOG: received fast shutdown request
2021-08-25 16:22:33 CEST LOG: aborting any active transactions
2021-08-25 16:22:33 CEST LOG: autovacuum launcher shutting down
The time gap is 6s.... so it might be a bit far away from the last process exit.
I could migrate the test DB to 9.6.23 and see if the problem persists.
Would it be worth adding additional code before every signal to trace the source ID and the target PID as well as the source/target process name?
The OS is at the latest patch level.
The compiler is at the latest patch level.
The 9.2.x is at the latest patch level.
I can also run a trace tomorrow, this would give me some information:
Sample output (shortened):
Wed Aug 25 17:58:51 2021
System: AIX 7.2 Node: host
Machine: 000000000000
Internet Address: 00000000 1.1.1.1
At trace startup, the system contained 16 cpus, of which 16 were traced.
Buffering: Kernel Heap
This is from a 64-bit kernel.
Tracing only these hooks, 14e0
ID PROCESS NAME PID TID I SYSTEM CALL ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT
001 trace 23789978 87687537 0.000000000 0.000000 TRACE ON channel 0
Wed Aug 25 17:58:51 2021
14E postgres: 18743746 85000571 7.903995939 2994.175459 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 85393753 7.904962367 0.966428 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 85393753 7.946566507 41.604140 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 17.902007437 2992.131623 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 94437835 17.903004949 0.997512 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 94437835 17.935897005 32.892056 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 28.001327251 3091.401199 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 40042983 28.002307781 0.980530 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 40042983 28.032432646 30.124865 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 37.901060572 2991.083160 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 88539511 37.902072470 1.011898 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 88539511 37.936426058 34.353588 kill: signal SIGUSR2 to process 18743746 postgres:
I do not observe this with V8.x servers.
That stupid problem is taking my nerves!!
Bye
Rainer
On 25.08.2021 17:13, Tom Lane wrote:
It did run the server with auto vacuum disabled for ~ 24h - no server shutdown.
After re-enabling auto vacuum the server dies in less then 9 hours:
Started: 2021-08-25 08:12:29
Dies: 2021-08-25 16:22:33
During the time of the shutdown there was no access to the server.
No running applications and no psql cli sessions.
I will let it run over night and see it the server is going down again.
There is no software installed on this AIX LPAR which uses this instance or sends signals to the server.
I did only do some interaction during the day to see if the server is working correctly.
Unfortunately I can not really see in the main process which other process did sent the signal SIGINT.
This is the only correlation I see:
2021-08-25 16:22:27 CEST DEBUG: server process (PID 19005776) exited with exit code 0
2021-08-25 16:22:33 CEST DEBUG: postmaster received signal 2
2021-08-25 16:22:33 CEST LOG: received fast shutdown request
2021-08-25 16:22:33 CEST LOG: aborting any active transactions
2021-08-25 16:22:33 CEST LOG: autovacuum launcher shutting down
The time gap is 6s.... so it might be a bit far away from the last process exit.
I could migrate the test DB to 9.6.23 and see if the problem persists.
Would it be worth adding additional code before every signal to trace the source ID and the target PID as well as the source/target process name?
The OS is at the latest patch level.
The compiler is at the latest patch level.
The 9.2.x is at the latest patch level.
I can also run a trace tomorrow, this would give me some information:
Sample output (shortened):
Wed Aug 25 17:58:51 2021
System: AIX 7.2 Node: host
Machine: 000000000000
Internet Address: 00000000 1.1.1.1
At trace startup, the system contained 16 cpus, of which 16 were traced.
Buffering: Kernel Heap
This is from a 64-bit kernel.
Tracing only these hooks, 14e0
ID PROCESS NAME PID TID I SYSTEM CALL ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT
001 trace 23789978 87687537 0.000000000 0.000000 TRACE ON channel 0
Wed Aug 25 17:58:51 2021
14E postgres: 18743746 85000571 7.903995939 2994.175459 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 85393753 7.904962367 0.966428 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 85393753 7.946566507 41.604140 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 17.902007437 2992.131623 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 94437835 17.903004949 0.997512 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 94437835 17.935897005 32.892056 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 28.001327251 3091.401199 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 40042983 28.002307781 0.980530 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 40042983 28.032432646 30.124865 kill: signal SIGUSR2 to process 18743746 postgres:
14E postgres: 18743746 85000571 37.901060572 2991.083160 kill: signal SIGUSR1 to process 25166296 postgres
14E --1- -1 88539511 37.902072470 1.011898 kill: signal SIGUSR2 to process 18743746 postgres:
14E --1- -1 88539511 37.936426058 34.353588 kill: signal SIGUSR2 to process 18743746 postgres:
I do not observe this with V8.x servers.
That stupid problem is taking my nerves!!
Bye
Rainer
On 25.08.2021 17:13, Tom Lane wrote:
Rainer Tammer <pgsql@spg.schulergroup.com> writes:2021-08-25 16:22:33 CEST DEBUG: postmaster received signal 2 2021-08-25 16:22:33 CEST LOG: received fast shutdown requestWell, something sent the postmaster SIGINT. There isn't any mechanism within Postgres itself that would do that; you need to look for outside causes. regards, tom lane
pgsql-bugs by date: