Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation) - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation) |
Date | |
Msg-id | 5005FF9F.4040706@ringerc.id.au Whole thread Raw |
In response to | Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation) |
List | pgsql-hackers |
On 07/18/2012 06:56 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> BTW, while we are on the subject: hasn't this split completely broken >>> the statistics about backend-initiated writes? >> Yes, it seems to have done just that. > So I went to fix this in the obvious way (attached), but while testing > it I found that the number of buffers_backend events reported during > a regression test run barely changed; which surprised the heck out of > me, so I dug deeper. The cause turns out to be extremely scary: > ForwardFsyncRequest isn't getting called at all in the bgwriter process, > because the bgwriter process has a pendingOpsTable. So it just queues > its fsync requests locally, and then never acts on them, since it never > runs any checkpoints anymore. > > This implies that nobody has done pull-the-plug testing on either HEAD > or 9.2 since the checkpointer split went in (2011-11-01) That makes me wonder if on top of the buildfarm, extending some buildfarm machines into a "crashfarm" is needed: - Keep kvm instances with copy-on-write snapshot disks and the build env on them - Fire up the VM, do a build, and start the server - From outside the vm have the test controller connect to the server and start a test run - Hard-kill the OS instance at a random point in time. - Start the OS instance back up - Start Pg back up and connect to it again - From the test controller, test the Pg install for possible corruption by reading the indexes and tables, doing some test UPDATEs, etc. The main challenge would be coming up with suitable tests to run, ones that could then be checked to make sure nothing was broken. The test controller would know how far a test got before the OS got killed and would know which test it was running, so it'd be able to check for expected data if provided with appropriate test metadata. Use of enable_ flags should permit scans of indexes and table heaps to be forced. What else should be checked? The main thing that comes to mind for me is something I've worried about for a while: that Pg might not always handle out-of-disk-space anywhere near as gracefully as it's often claimed to. There's no automated testing for that, so it's hard to really know. A harnessed VM could be used to test that. Instead of virtual plug pull tests it could generate a virtual disk of constrained random size, run its tests until out-of-disk caused failure, stop Pg, expand the disk, restart Pg, and run its checks. Variants where WAL was on a separate disk and only WAL or only the main non-WAL disk run out of space would also make sense and be easy to produce with such a harness. I've written some automated kvm test harnesses, so I could have a play with this idea. I would probably need some help with the test design, though, and the guest OS would be Linux, Linux, or Linux at least to start with. Opinions? -- Craig Ringer
pgsql-hackers by date: