Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid` - Mailing list pgsql-bugs
From | Brian Ghidinelli |
---|---|
Subject | Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid` |
Date | |
Msg-id | F7A6B776-6DD9-466E-822D-C28CBC3A8A2B@vfive.com Whole thread Raw |
Responses |
Re: BUG #13970: Vacuum hangs on particular table; cannot be
terminated - requires `kill -QUIT pid`
|
List | pgsql-bugs |
=C3=81lvaro Herrera wrote: > This smells like it's looping waiting for a multixact to be fully > written out ... except that the uninterruptibility part of that was = fixed in > time for 9.4.0, >=20 > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> > Branch: master Release: REL9_5_BR [51f9ea25d] 2014-11-14 15:14:01 = -0300 > Branch: REL9_4_STABLE Release: REL9_4_0 [137e4da6d] 2014-11-14 = 15:14:02 -0300 > Branch: REL9_3_STABLE Release: REL9_3_6 [d45e8dc52] 2014-11-14 = 15:14:02 -0300 >=20 > = http://git.postgresql.org/gitweb/?p=3Dpostgresql.git;a=3Dcommit;h=3D137e4d= a6d >=20 >=20 > Can you attach to the looping process with gdb when it's doing the > select() dance, and obtain a backtrace? You need debug symbols > installed. Alvaro, thanks for the quick response. I=E2=80=99ve done some additional = research and found your involvement on other threads in 2014 and 2015 = that have allowed me to collect more supporting details. In a nutshell, = it appears to be multixid related and /maybe/ pg_upgrade related; I = don=E2=80=99t know if we used it on the 9.3 branch or not. I don=E2=80=99t= have the debug symbols but I did attach gdb and could see it was = working on multixid. I don=E2=80=99t have the ability to compile from = source on this production server unfortunately but I=E2=80=99m happy to = do whatever else I can to help the team find a solution. First, I checked the pg_upgrade changelog note in 9.4.4 and here=E2=80=99s= what I found in my install: * pg_controldata DOES report =E2=80=9CLatest checkpoint=E2=80=99s = oldestMultiXid=E2=80=9D =3D 1 * There IS a file named 0000 in PGDATA/pg_multixact/offsets pg_controldata: # /usr/pgsql-9.4/bin/pg_controldata ~postgres/9.4/data/ pg_control version number: 942 Catalog version number: 201409291 Database system identifier: 6123001933830237052 Database cluster state: in production pg_control last modified: Thu 18 Feb 2016 09:53:29 AM PST Latest checkpoint location: 130/6FBD9E58 Prior checkpoint location: 130/6F9ED950 Latest checkpoint's REDO location: 130/6FAEEFC0 Latest checkpoint's REDO WAL file: 00000001000001300000006F Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/90358137 Latest checkpoint's NextOID: 5523460 Latest checkpoint's NextMultiXactId: 84342 Latest checkpoint's NextMultiOffset: 167545 Latest checkpoint's oldestXID: 654 Latest checkpoint's oldestXID's DB: 16416 Latest checkpoint's oldestActiveXID: 90358137 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 16416 Time of latest checkpoint: Thu 18 Feb 2016 09:52:31 AM PST Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: hot_standby Current wal_log_hints setting: off Current max_connections setting: 100 Current max_worker_processes setting: 8 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 the autovacuum log status: # select pg_class.relname, pg_locks.transactionid, pg_locks.mode, pg_locks.granted as "g", left(pg_stat_activity.query, 60) as query, pg_stat_activity.query_start, age(now(),pg_stat_activity.query_start) as "age", pg_stat_activity.pid=20 from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation =3D pg_class.oid) =20 where pg_locks.pid=3Dpg_stat_activity.pid and pg_stat_activity.pid =3D 14578 order by query_start; relname | transactionid | = mode | g | query | = query_start | age | pid =20 = -----------------------------------------+---------------+----------------= ----------+---+-----------------------------------------------+-----------= --------------------+-----------------+------- idx_clubmember_memberid | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 idx_lookupclubmember_uidmember | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 idx_tbllookupclubmember_uidclubofrecord | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 xpktbllookupclubmember_mem | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 tbllookupclubmember_pkey | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 | | ExclusiveLock = | t | autovacuum: VACUUM public.tbllookupclubmember | = 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578 tbllookupclubmember | | = ShareUpdateExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 gdb bt for pid 14578: Program received signal SIGINT, Interrupt. 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 (gdb) bt #0 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 #1 0x000000000077aa9a in pg_usleep () #2 0x00000000004b66af in GetMultiXactIdMembers () #3 0x0000000000487b93 in heap_prepare_freeze_tuple () #4 0x0000000000585e07 in lazy_vacuum_rel () #5 0x0000000000583a75 in ?? () #6 0x0000000000583e37 in vacuum () #7 0x0000000000616a06 in ?? () #8 0x00000000006171db in ?? () #9 0x0000000000617289 in StartAutoVacWorker () #10 0x0000000000624cf2 in ?? () #11 <signal handler called> #12 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 #13 0x0000000000621f76 in ?? () #14 0x000000000062619a in PostmasterMain () #15 0x00000000005bd52d in main () (gdb) cont Continuing. A couple other things I saw you request in other threads: # select oid::regclass, relfrozenxid, relminmxid from pg_class where = relname =3D 'tbllookupclubmember'; oid | relfrozenxid | relminmxid=20 ---------------------+--------------+------------ tbllookupclubmember | 22915818 | 616 (1 row) # select datname, datfrozenxid, datminmxid from pg_database where = datname =3D 'prod'; datname | datfrozenxid | datminmxid=20 --------------------+--------------+------------ prod | 654 | 1 # ls -al ~postgres/9.4/data/pg_multixact/offsets/ total 368 drwx------ 2 postgres postgres 4096 Mar 5 2015 . drwx------ 4 postgres postgres 4096 Mar 5 2015 .. -rw------- 1 postgres postgres 262144 Mar 5 2015 0000 -rw------- 1 postgres postgres 81920 Feb 18 09:47 0001 In this thread: = http://postgresql.nabble.com/BUG-11264-Auto-vacuum-wraparound-job-blocking= -everything-td5816237.html, I saw you recommend this fix: > > How do we fix the current issue with this one server? Is there an = easy=20 > >fix?=20 > > Thanks.=20 >=20 > As far as I am aware, you should=20 > UPDATE pg_database SET datminmxid=3D20783=20 >=20 > and that should fix it. The oldestMulti value in pg_control would get=20= > updated by itself some time later. If you experience stalls before=20 > oldestMulti fixes itself, you could stop the server (cleanly!) and = then=20 > pg_resetxlog -m x,y where x is the correct nextMulti value from=20 > pg_controldata and y is 20783.=20 Is that something I should/could do here? I=E2=80=99ve temporarily = disabled autovacuum on the offending table until we can figure it out. Thank you! Brian
pgsql-bugs by date: