BUG #4692: VACUUM: write to WAL gets very slow and seems redundant - Mailing list pgsql-bugs
From | Peter Much |
---|---|
Subject | BUG #4692: VACUUM: write to WAL gets very slow and seems redundant |
Date | |
Msg-id | 200903041628.n24GSfHL064596@wwwmaster.postgresql.org Whole thread Raw |
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 4692 Logged by: Peter Much Email address: pmc@citylink.dinoex.sub.org PostgreSQL version: 8.2.7 Operating system: FreeBSD 6.3 Description: VACUUM: write to WAL gets very slow and seems redundant Details: While most of the time the system does run very fine, sometimes (i perceived it only during VACUUM) the throughput breaks down dramatically: while normally the VACUUM fills multiple WAL logs per minute, it then takes 10 minutes to fill each WAL log, while the system is rather idle: Normal: Mar 4 03:18:29 <local0.info> edge postgres[1781]: [11-1] :[] LOG: archived transaction log file "000000020000004200000097" Mar 4 03:19:09 <local0.info> edge postgres[1781]: [12-1] :[] LOG: archived transaction log file "000000020000004200000098" Mar 4 03:20:26 <local0.info> edge postgres[1781]: [13-1] :[] LOG: archived transaction log file "000000020000004200000099" Mar 4 03:21:11 <local0.info> edge postgres[1781]: [14-1] :[] LOG: archived transaction log file "00000002000000420000009A" Defect: Mar 4 10:16:16 <local0.info> edge postgres[1781]: [84-1] :[] LOG: archived transaction log file "0000000200000042000000E0" Mar 4 10:25:19 <local0.info> edge postgres[1781]: [85-1] :[] LOG: archived transaction log file "0000000200000042000000E1" Mar 4 10:37:20 <local0.info> edge postgres[1781]: [86-1] :[] LOG: archived transaction log file "0000000200000042000000E2" Mar 4 10:44:04 <local0.info> edge postgres[1781]: [87-1] :[] LOG: archived transaction log file "0000000200000042000000E3" Looking closer at this. 1.) The only process being active on the system is the VACUUM (FULL) task: 6322 ?? Ds 17:16.52 postgres: pgsql bacula 192.168.98.6(50283) VACUUM (po 2.) the system load is near idle. 3.) "top" shows the vacuum process waiting in "bo_waa" (or "biowr") nearly all the time: PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 6322 pgsql 1 -5 0 76596K 44324K bo_wwa 17:22 1.03% postgres 4.) The disk with the WAL logs is 98% busy, and the only file that is read or written there is the current WAL log: [root@edge /var/pgsql/data]# find . -type f -exec ls -lu {} \; | grep "Mar 4 12:4" [root@edge /var/pgsql/data]# find . -type f -exec ls -l {} \; | grep "Mar 4 12:4" -rw------- 1 pgsql pgsql 16777216 Mar 4 12:41 ./pg_xlog/0000000200000042000000F0 5.) lsof shows the file-pointer in the WAL log proceeding: [root@edge /var/pgsql/data]# lsof -o pg_xlog/0000000200000042000000F0 ; sleep 10; lsof -o pg_xlog/0000000200000042000000F0 COMMAND PID USER FD TYPE DEVICE OFFSET NODE NAME postgres 1780 pgsql 9u VREG 0,167 0t8585216 14344 pg_xlog/0000000200000042000000F0 postgres 6322 pgsql 11u VREG 0,167 0t14090240 14344 pg_xlog/0000000200000042000000F0 COMMAND PID USER FD TYPE DEVICE OFFSET NODE NAME postgres 1780 pgsql 9u VREG 0,167 0t8585216 14344 pg_xlog/0000000200000042000000F0 postgres 6322 pgsql 11u VREG 0,167 0t14524416 14344 pg_xlog/0000000200000042000000F0 So, during 10 seconds the 6322 process has proceeded 434176 bytes, giving 43 kBps. 6.) iostat shows the WAL disk processing ~60 accesses per second, 16kB each (while the database disk does nearly nothing): WAL KB/t tps MB/s 16.70 69 1.12 16.76 63 1.03 16.00 45 0.70 16.00 58 0.90 16.71 68 1.10 This shows about 1000 kBps. iostat does not show me which process does this activity, but since the VACUUM process 6322 is the only active process, and since this process is in "biowr" state all the time, the activity could hardly come from an other process. Evaluating, we have ~1000 kBps thruput but only 43 kBps write advance on the WAL log, this gives about factor 1:20. And from the logs we see: about 30secs per WAL log in normal operation, and about 10mins in the defect state, this is also 1:20. So it seems I am looking at the right thing, and there is no problem on the system. The only question is: what the **** is this VACUUM process doing? When killing the 6322 process, the next task behaves normal: WAL DB KB/t tps MB/s KB/t tps MB/s 82.67 3 0.24 22.79 197 4.39 31.25 8 0.24 21.16 154 3.19 26.00 8 0.20 21.43 164 3.44 44.80 5 0.22 24.44 143 3.42 22.40 5 0.11 23.85 162 3.78 24.00 4 0.09 22.45 195 4.28 The behaviour occurs only occasionally; as we can see from the postgres log above, at 3AM the VACUUM was running fine, only later it started with this behaviour - for no apparent reason.
pgsql-bugs by date: