Thread: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
The following bug has been logged online: Bug reference: 5321 Logged by: Jerry Gamache Email address: jerry.gamache@idilia.com PostgreSQL version: 8.4.2 Operating system: Linux Description: Parallel restore temporarily deadlocked by autovacuum analyze Details: While restoring a database in 8.4.2, pg_restore stopped for more than 30 minutes doing nothing: pg_restore -d database1 -j 8 -v database1.pg_dump The parallel restore showed very impressive performance as the verbose output went by, but the restore suddenly stopped. The output of "ps" after 15 minutes of 0% CPU activity showed 4 process in "Ss" state: postgres: xxx database1 [local] CREATE INDEX waiting postgres: xxx database1 [local] CREATE INDEX waiting postgres: autovacuum worker process database1 postgres: autovacuum worker process database1 The output of pg_stat_activity for database1 showed: database1=# select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid; procpid | datname | current_query ---------+------------------+----------------------------------------------- ---------------------------------------------------------------- 6146 | database1 | autovacuum: ANALYZE public.table_x 6270 | database1 | : CREATE UNIQUE INDEX index1 ON table_x USING btree (col_a, col_b); 6271 | database1 | : CREATE UNIQUE INDEX index2 ON table_x USING btree (col_b, col_c, col_d); 6307 | database1 | autovacuum: ANALYZE public.table_y 6597 | database1 | <IDLE> 6626 | database1 | select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid; (6 rows) The output of pg_locks for the 4 stuck PIDs showed: database1=# select * from pg_locks where pid IN (SELECT procpid FROM pg_stat_activity WHERE datname='database1') ORDER BY pid; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ------------+----------+----------+------+-------+------------+------------- --+---------+-------+----------+--------------------+------+---------------- ----------+--------- relation | 16993 | 17597 | | | | | | | | 11/57 | 6146 | AccessShareLock | t relation | 16993 | 17616 | | | | | | | | 11/57 | 6146 | AccessShareLock | t relation | 16993 | 17663 | | | | | | | | 11/57 | 6146 | AccessShareLock | t virtualxid | | | | | 11/57 | | | | | 11/57 | 6146 | ExclusiveLock | t relation | 16993 | 17165 | | | | | | | | 11/57 | 6146 | ShareUpdateExclusiveLock | t virtualxid | | | | | 13/100 | | | | | 13/100 | 6270 | ExclusiveLock | t relation | 16993 | 17165 | | | | | | | | 13/100 | 6270 | ShareLock | f relation | 16993 | 17165 | | | | | | | | 14/50 | 6271 | ShareLock | f virtualxid | | | | | 14/50 | | | | | 14/50 | 6271 | ExclusiveLock | t relation | 16993 | 17188 | | | | | | | | 3/844 | 6307 | ShareUpdateExclusiveLock | t virtualxid | | | | | 3/844 | | | | | 3/844 | 6307 | ExclusiveLock | t relation | 16993 | 17660 | | | | | | | | 3/844 | 6307 | AccessShareLock | t relation | 16993 | 17590 | | | | | | | | 3/844 | 6307 | AccessShareLock | t relation | 0 | 2677 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 0 | 2676 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 16993 | 11042 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 0 | 2671 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 0 | 2672 | | | | | | | | 7/187 | 6626 | AccessShareLock | t virtualxid | | | | | 7/187 | | | | | 7/187 | 6626 | ExclusiveLock | t relation | 16993 | 10969 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 0 | 1262 | | | | | | | | 7/187 | 6626 | AccessShareLock | t relation | 0 | 1260 | | | | | | | | 7/187 | 6626 | AccessShareLock | t (22 rows) The restore resumed while I was writing this report, and I saw these new entries in the logs: ERROR: canceling autovacuum task CONTEXT: automatic analyze of table "database1.public.table_y" ERROR: canceling autovacuum task CONTEXT: automatic analyze of table "database1.public.table_y"
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Alvaro Herrera
Date:
Jerry Gamache wrote: > The restore resumed while I was writing this report, and I saw these new > entries in the logs: > > ERROR: canceling autovacuum task > CONTEXT: automatic analyze of table "database1.public.table_y" > ERROR: canceling autovacuum task > CONTEXT: automatic analyze of table "database1.public.table_y" This is what I would have expected to happen almost immediately ... what are your autovacuum settings? And what's your deadlock timeout? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Jerry Gamache wrote: >> The restore resumed while I was writing this report, and I saw these new >> entries in the logs: >> >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" > This is what I would have expected to happen almost immediately ... what > are your autovacuum settings? And what's your deadlock timeout? The other details showed the blockage as being on table_x though. Was that a typo in obscuring the details, or did the cancel really happen on an unrelated table? regards, tom lane
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Alvaro Herrera
Date:
Jerry Gamache wrote: > I was also surprised that table_y seemed to be involved. This is not > a typo. Might be caused by a FK constraint between table_y and > table_x. From the logs, the autovacuum on table_x was canceled > before the one on table_y, but the restore only resumed after the > autovacuum on table_y was canceled. It is possible (but I cannot > confirm) that the autovacuum thread on table_x was blocked for a > while after the cancellation message was written to the log. I added > timestamps to log_line_prefix to be able to give more details if > this happens again. Could you try to restore the whole dump again and see if it you can reproduce it? Maybe decreasing autovacuum_naptime makes it more probable. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Alvaro Herrera
Date:
Jerry Gamache wrote: > I was not able to repro with default parameters, or at 15s naptime, > and at 1s naptime I got only 1deadlock in 3 tests. > > This time the deadlock was with table_a, table_b and table_c > (table_x and table_y were not involved). > > 18395 | database1 | autovacuum: ANALYZE public.table_a > 18406 | database1 | autovacuum: ANALYZE public.table_b > 18510 | database1 | > : CREATE UNIQUE INDEX index_bg ON table_b > USING btree (col_g); > 18567 | database1 | autovacuum: ANALYZE public.table_c > 18802 | database1 | select procpid,datname,current_query from > pg_stat_activity where datname='database1' ORDER BY procpid; > > There is a FK constraint between table_a and table_b, but table_c > does not have any direct constraint relation with the other 2 > tables. > > The logs show that the autovacuum of table_b was canceled 20 minutes > ago, but the thread is still alive and blocked. That's pretty strange. Can we see a pg_locks snapshot? (Please attach as a text file so that it doesn't get word-wrapped) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Alvaro Herrera
Date:
Jerry Gamache wrote: > Here is the pg_locks output. > > Alvaro Herrera wrote: > >Jerry Gamache wrote: > >>The logs show that the autovacuum of table_b was canceled 20 minutes > >>ago, but the thread is still alive and blocked. Well, it's clearly locked on table_b, and that autovac is still running ... maybe it was a previous run that was cancelled? Do you have the PID in log_line_prefix? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Alvaro Herrera
Date:
Jerry Gamache wrote: > Yes, I have PID in the logs now. Problem was observed around 13h30, > and there was no log output between 13h18 and 13h30. > There are messages for table_b (pid 18350) and table_c (pid 18406), > but none for table_a. Eh, but according to the pg_locks snap you posted, the PID holding the table_b lock is (was?) 20490. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
All autovacuum and deadlock_timeout settings are at their default values (lines are commented out in postgresql.conf). Alvaro Herrera wrote: > Jerry Gamache wrote: > > >> The restore resumed while I was writing this report, and I saw these new >> entries in the logs: >> >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" >> > > This is what I would have expected to happen almost immediately ... what > are your autovacuum settings? And what's your deadlock timeout? > >
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
I was also surprised that table_y seemed to be involved. This is not a typo. Might be caused by a FK constraint between table_y and table_x. From the logs, the autovacuum on table_x was canceled before the one on table_y, but the restore only resumed after the autovacuum on table_y was canceled. It is possible (but I cannot confirm) that the autovacuum thread on table_x was blocked for a while after the cancellation message was written to the log. I added timestamps to log_line_prefix to be able to give more details if this happens again. Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> Jerry Gamache wrote: >> >>> The restore resumed while I was writing this report, and I saw these new >>> entries in the logs: >>> >>> ERROR: canceling autovacuum task >>> CONTEXT: automatic analyze of table "database1.public.table_y" >>> ERROR: canceling autovacuum task >>> CONTEXT: automatic analyze of table "database1.public.table_y" >>> > > >> This is what I would have expected to happen almost immediately ... what >> are your autovacuum settings? And what's your deadlock timeout? >> > > The other details showed the blockage as being on table_x though. > Was that a typo in obscuring the details, or did the cancel really > happen on an unrelated table? > > regards, tom lane > >
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
I was not able to repro with default parameters, or at 15s naptime, and at 1s naptime I got only 1deadlock in 3 tests. This time the deadlock was with table_a, table_b and table_c (table_x and table_y were not involved). 18395 | database1 | autovacuum: ANALYZE public.table_a 18406 | database1 | autovacuum: ANALYZE public.table_b 18510 | database1 | : CREATE UNIQUE INDEX index_bg ON table_b USING btree (col_g); 18567 | database1 | autovacuum: ANALYZE public.table_c 18802 | database1 | select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid; There is a FK constraint between table_a and table_b, but table_c does not have any direct constraint relation with the other 2 tables. The logs show that the autovacuum of table_b was canceled 20 minutes ago, but the thread is still alive and blocked. Alvaro Herrera wrote: > Jerry Gamache wrote: > >> I was also surprised that table_y seemed to be involved. This is not >> a typo. Might be caused by a FK constraint between table_y and >> table_x. From the logs, the autovacuum on table_x was canceled >> before the one on table_y, but the restore only resumed after the >> autovacuum on table_y was canceled. It is possible (but I cannot >> confirm) that the autovacuum thread on table_x was blocked for a >> while after the cancellation message was written to the log. I added >> timestamps to log_line_prefix to be able to give more details if >> this happens again. >> > > Could you try to restore the whole dump again and see if it you can > reproduce it? Maybe decreasing autovacuum_naptime makes it more > probable. > >
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
Here is the pg_locks output. Alvaro Herrera wrote: > Jerry Gamache wrote: > >> I was not able to repro with default parameters, or at 15s naptime, >> and at 1s naptime I got only 1deadlock in 3 tests. >> >> This time the deadlock was with table_a, table_b and table_c >> (table_x and table_y were not involved). >> >> 18395 | database1 | autovacuum: ANALYZE public.table_a >> 18406 | database1 | autovacuum: ANALYZE public.table_b >> 18510 | database1 | >> : CREATE UNIQUE INDEX index_bg ON table_b >> USING btree (col_g); >> 18567 | database1 | autovacuum: ANALYZE public.table_c >> 18802 | database1 | select procpid,datname,current_query from >> pg_stat_activity where datname='database1' ORDER BY procpid; >> >> There is a FK constraint between table_a and table_b, but table_c >> does not have any direct constraint relation with the other 2 >> tables. >> >> The logs show that the autovacuum of table_b was canceled 20 minutes >> ago, but the thread is still alive and blocked. >> > > That's pretty strange. Can we see a pg_locks snapshot? (Please attach > as a text file so that it doesn't get word-wrapped) > > database1=# select locktype,database,relation,virtualxid,virtualtransaction,pid,mode,granted from pg_locks where pid IN (SELECTprocpid FROM pg_stat_activity WHERE datname='database1') ORDER BY pid; locktype | database | relation | virtualxid | virtualtransaction | pid | mode | granted ------------+----------+----------+------------+--------------------+-------+--------------------------+--------- relation | 20162 | 20893 | | 4/72 | 18395 | AccessShareLock | t virtualxid | | | 4/72 | 4/72 | 18395 | ExclusiveLock | t relation | 20162 | 20829 | | 4/72 | 18395 | AccessShareLock | t relation | 20162 | 20894 | | 4/72 | 18395 | AccessShareLock | t relation | 20162 | 20892 | | 4/72 | 18395 | AccessShareLock | t relation | 20162 | 20515 | | 4/72 | 18395 | ShareUpdateExclusiveLock | t relation | 20162 | 20891 | | 3/53 | 18406 | AccessShareLock | t relation | 20162 | 20813 | | 3/53 | 18406 | AccessShareLock | t relation | 20162 | 20490 | | 3/53 | 18406 | ShareUpdateExclusiveLock | t virtualxid | | | 3/53 | 3/53 | 18406 | ExclusiveLock | t relation | 20162 | 20490 | | 8/162 | 18510 | ShareLock | f virtualxid | | | 8/162 | 8/162 | 18510 | ExclusiveLock | t relation | 20162 | 20247 | | 7/238 | 18567 | ShareUpdateExclusiveLock | t relation | 20162 | 20872 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20881 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20880 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20878 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20873 | | 7/238 | 18567 | AccessShareLock | t virtualxid | | | 7/238 | 7/238 | 18567 | ExclusiveLock | t relation | 20162 | 20876 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20882 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20879 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20797 | | 7/238 | 18567 | AccessShareLock | t relation | 20162 | 20877 | | 7/238 | 18567 | AccessShareLock | t relation | 0 | 2676 | | 1/611 | 18802 | AccessShareLock | t relation | 0 | 2672 | | 1/611 | 18802 | AccessShareLock | t virtualxid | | | 1/611 | 1/611 | 18802 | ExclusiveLock | t relation | 0 | 1262 | | 1/611 | 18802 | AccessShareLock | t relation | 20162 | 11042 | | 1/611 | 18802 | AccessShareLock | t relation | 0 | 2671 | | 1/611 | 18802 | AccessShareLock | t relation | 0 | 2677 | | 1/611 | 18802 | AccessShareLock | t relation | 20162 | 10969 | | 1/611 | 18802 | AccessShareLock | t relation | 0 | 1260 | | 1/611 | 18802 | AccessShareLock | t (33 rows) database1=# select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid; procpid | datname | current_query ---------+------------------+--------------------------------------------------------------------------------------------------------------- 18395 | database1 | autovacuum: ANALYZE public.table_a 18406 | database1 | autovacuum: ANALYZE public.table_b 18510 | database1 | : CREATE UNIQUE INDEX index_bg ON table_b USING btree (col_g); 18567 | database1 | autovacuum: ANALYZE public.table_c 18802 | database1 | select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid; (5 rows)
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
Yes, I have PID in the logs now. Problem was observed around 13h30, and there was no log output between 13h18 and 13h30. There are messages for table_b (pid 18350) and table_c (pid 18406), but none for table_a. Alvaro Herrera wrote: >>>> The logs show that the autovacuum of table_b was canceled 20 minutes >>>> ago, but the thread is still alive and blocked. >>>> > > Well, it's clearly locked on table_b, and that autovac is still running > ... maybe it was a previous run that was cancelled? Do you have the PID > in log_line_prefix? > > 2010-02-10 13:10:26 EST::@:[18301]:ERROR: canceling autovacuum task 2010-02-10 13:10:26 EST::@:[18301]:CONTEXT: automatic analyze of table "database1.public.table_d" 2010-02-10 13:10:42 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:10:42 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:10:58 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:10:58 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_f" 2010-02-10 13:11:00 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:00 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_g" 2010-02-10 13:11:08 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:11:08 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_h" 2010-02-10 13:11:17 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:17 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_i" 2010-02-10 13:11:22 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:11:22 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_j" 2010-02-10 13:11:42 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:42 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_b" 2010-02-10 13:13:39 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:13:39 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:13:39 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:13:39 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_g" 2010-02-10 13:14:07 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:14:07 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_c" 2010-02-10 13:14:49 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_h" 2010-02-10 13:14:49 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_k" 2010-02-10 13:14:49 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_l" 2010-02-10 13:16:25 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:16:25 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_m" 2010-02-10 13:17:13 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:17:13 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_i" 2010-02-10 13:18:20 EST::@:[18567]:ERROR: canceling autovacuum task 2010-02-10 13:18:20 EST::@:[18567]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:18:31 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:18:31 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_l"
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
From
Jerry Gamache
Date:
Alvaro Herrera wrote: > Jerry Gamache wrote: > >> Yes, I have PID in the logs now. Problem was observed around 13h30, >> and there was no log output between 13h18 and 13h30. >> There are messages for table_b (pid 18350) and table_c (pid 18406), >> but none for table_a. >> > > Eh, but according to the pg_locks snap you posted, the PID holding the > table_b lock is (was?) 20490. > > 20490 is the relid of table_b. The PID holding the lock on table_b is 18406.