BUG #16199: pg_restore stuck on interrupts - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16199: pg_restore stuck on interrupts |
Date | |
Msg-id | 16199-cb2f121146a96f9b@postgresql.org Whole thread Raw |
Responses |
Re: BUG #16199: pg_restore stuck on interrupts
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16199 Logged by: Raúl Marín Email address: git@rmr.ninja PostgreSQL version: 11.5 Operating system: Linux Description: We are seeing stuck pg_restore processes in several of our CI servers, both with PG10 (10.2) and PG11 (11.5). These servers test an external application which uses pg_restore quite often to set up test environments. A sample call might look like this: ``` /usr/lib/postgresql/10/bin/pg_restore -e --verbose -j4 --disable-triggers -Fc /tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/org_cd2b2fb0-96f9-476f-9589-b1d5e0b93957.dump -U postgres -h localhost -p 5432 -d cartodb_test_user_684a0497-9358-49ae-89b7-5feea5a6b7d7_db --section=post-data --use-list=/tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/user_org44000054-admin.list ``` With the appropriate symbols I can attach to the process and extract the callstack (not always fully available): ``` (gdb) bt #0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f4b03e152bd in _int_free (av=0x7f4b04159b20 <main_arena>, p=0x55d43590e3b0, have_lock=0) at malloc.c:3962 #2 0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #3 0x00007f4b01bbf464 in _gnutls_ext_deinit () at gnutls_extensions.c:405 #4 0x00007f4b01bc6720 in _gnutls_global_deinit (destructor=1) at gnutls_global.c:389 #5 0x00007f4b047eede7 in _dl_fini () at dl-fini.c:235 #6 0x00007f4b03dceff8 in __run_exit_handlers (status=status@entry=1, listp=0x7f4b041595f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #7 0x00007f4b03dcf045 in __GI_exit (status=status@entry=1) at exit.c:104 #8 0x000055d4337dbb5e in sigTermHandler (postgres_signal_arg=<optimized out>) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613 #9 <signal handler called> #10 _int_free (av=0x7f4b04159b20 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4017 #11 0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #12 0x00007f4b03e1054b in __GI__IO_setb (f=f@entry=0x55d435914240, b=b@entry=0x0, eb=eb@entry=0x0, a=a@entry=0) at genops.c:382 #13 0x00007f4b03e0e8ee in _IO_new_file_close_it (fp=fp@entry=0x55d435914240) at fileops.c:185 #14 0x00007f4b03e023ef in _IO_new_fclose (fp=0x55d435914240) at iofclose.c:58 #15 0x000055d4337d798e in _ReopenArchive (AH=0x55d435aa94e0) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_custom.c:763 #16 0x000055d4337dc3af in RunWorker (slot=0x55d435ada4b8, AH=0x55d435aa94e0) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:866 #17 ParallelBackupStart (AH=AH@entry=0x55d43590fe00) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:1027 #18 0x000055d4337d4418 in RestoreArchive (AHX=0x55d43590fe00) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_archiver.c:649 #19 0x000055d4337ce9ca in main (argc=17, argv=<optimized out>) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_restore.c:429 ``` I have a some extra processes with the same issue (7 full stacks out of 20, the others are garbage) and, from what I see, they all have in common that the process has received a signal while it was doing a memory operation, either a malloc or a free: * Malloc example: ``` #8 0x00005634a8ebeb5e in sigTermHandler (postgres_signal_arg=<optimized out>) at /build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613 #9 <signal handler called> #10 _int_malloc (av=av@entry=0x7fce15142b20 <main_arena>, bytes=bytes@entry=17584) at malloc.c:3644 #11 0x00007fce14e02184 in __GI___libc_malloc (bytes=17584) at malloc.c:2913 #12 0x00007fce14732e78 in CRYPTO_malloc () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #13 0x00007fce14b3dafd in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #14 0x00007fce14b3db68 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #15 0x00007fce14b3fb6c in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #16 0x00007fce155a4fb5 in open_client_SSL (conn=0x5634a9a1a370) ``` * Free example: ``` #8 0x0000563455b60e7e in sigTermHandler (postgres_signal_arg=<optimized out>) at /build/postgresql-11-3orpls/postgresql-11-11.5.2+carto/build/../src/bin/pg_dump/parallel.c:615 #9 <signal handler called> #10 0x00007fbba21bb8e4 in _int_free (av=0x7fbba24ffb20 <main_arena>, p=0x5634579993b0, have_lock=0) at malloc.c:3976 #11 0x00007fbba21bf53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #12 0x0000563455b6140e in set_archive_cancel_info (AH=AH@entry=0x5634579ad350, conn=conn@entry=0x0) ``` After that, the process gives control to `sigTermHandler` which calls `exit(1)` to die, but that gives control to whoever has setup atexit functions, in our case `gnutls`. Once gnutls tries to call `free` as part of its cleanup process, it will try to acquire the lock and block, but since the lock is being held by the function interrupted by the signal the process self locks and hangs forever. I think if would be safer to use a similar approach to other processes, that is use the handler to only enable a global flag and check that in the main loop, but I'm having a hard time locating what the proper place to check the flag would be. Regards, Raúl Marín.
pgsql-bugs by date: