Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues - Mailing list pgsql-bugs
From | Jeff Frost |
---|---|
Subject | Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues |
Date | |
Msg-id | D2DE538F-32D2-4F70-9260-1FDAAE0F219D@pgexperts.com Whole thread Raw |
In response to | Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues (Jeff Frost <jeff@pgexperts.com>) |
Responses |
Re: BUG #7902: lazy cleanup of extraneous WAL files can cause
out of disk issues
|
List | pgsql-bugs |
On Feb 24, 2013, at 1:05 PM, Jeff Frost <jeff@pgexperts.com> wrote: >=20 > On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero = <r.m.guerrero@usit.uio.no> wrote: >=20 >> We reported this back in 2011, but we did not get to any conclusion: >> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no >>=20 >> In our case, we had this problem when creating a GIN index. >>=20 >> I think the problem has something to do with checkpoints, I think the >> number of WAL files will grow beyond the limit defined because the >> system can not be finish with checkpoints. A good start to try to >> explain what it is happening would be to try to identify or describe = the >> situations where checkpoints can take very long to complete or fail >> altogether. >>=20 >> We are interested in this subject because we have had this problem a = few >> times. But we have not been able to identify the reason that triggers = this. >=20 > Rafael, did you do a followup post on pgsql-performance, or did the = thread die at the end of the one you post above? This is how the log_checkpoint output looks during the run: 2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: immediate force = wait",,,,,,,,,"" 2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 108 buffers = (0.0%); 0 transaction log file(s) added, 0 removed, 4 recycled; = write=3D0.054 s, sync=3D0.002 s, total=3D0.059 s; sync files=3D10, = longest=3D0.000 s, average=3D0.000 s",,,,,,,,,"" So, this ^^^^^ is the manual checkpoint prior to the pgbench run and = everything else is part of the pgbench run. 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoints are occurring too frequently (14 = seconds apart)",,"Consider increasing the configuration parameter = ""checkpoint_segments"".",,,,,,,"" 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,"" 2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 266935 buffers = (12.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; = write=3D10.889 s, sync=3D99.079 s, total=3D110.011 s; sync files=3D29, = longest=3D12.888 s, average=3D3.416 s",,,,,,,,,"" 2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,"" 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 801517 buffers = (38.2%); 0 transaction log file(s) added, 0 removed, 128 recycled; = write=3D15.484 s, sync=3D148.739 s, total=3D164.266 s; sync files=3D32, = longest=3D11.773 s, average=3D4.648 s",,,,,,,,,"" 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,"" 2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1009466 buffers = (48.1%); 0 transaction log file(s) added, 140 removed, 257 recycled; = write=3D13.690 s, sync=3D167.245 s, total=3D180.980 s; sync files=3D33, = longest=3D14.442 s, average=3D5.067 s",,,,,,,,,"" 2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,"" 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1055059 buffers = (50.3%); 0 transaction log file(s) added, 259 removed, 257 recycled; = write=3D22.731 s, sync=3D168.239 s, total=3D191.020 s; sync files=3D32, = longest=3D13.650 s, average=3D5.257 s",,,,,,,,,"" 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,"" 2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 = 02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 814767 buffers = (38.9%); 0 transaction log file(s) added, 546 removed, 0 recycled; = write=3D269.553 s, sync=3D95.042 s, total=3D364.693 s; sync files=3D31, = longest=3D11.211 s, average=3D3.065 s",,,,,,,,,"" They're all number of xlog based and the first one doesn't remove any, = that's not surprising, but I'm surprised that the second one doesn't = remove any. It does recycle quite a few. I guess the issue is that it = can generate WAL files way faster than it can checkpoint and cleanup. I = think I need to write a script to output the number of WAL files each = minute so we can better correlate what's going on.=
pgsql-bugs by date: