Re: amcheck (B-Tree integrity checking tool) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: amcheck (B-Tree integrity checking tool) |
Date | |
Msg-id | CAM3SWZQ-vuss3ttYOHjv86s+zp2oaj6A4CrQDjmXBQfbCGoZsA@mail.gmail.com Whole thread Raw |
In response to | Re: amcheck (B-Tree integrity checking tool) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: amcheck (B-Tree integrity checking tool)
|
List | pgsql-hackers |
On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote: > That's right - I have a small number of feedback items to work > through. I also determined myself that there could be a very low > probability race condition when checking the key space across sibling > pages, and will work to address that. I attach a V2 revision. Behavioral changes: * Fixes very low probability race condition when verifying ordering across sibling pages with bt_index_check()/AccessShareLock. Notably, the fix does not involve using the high key from the next page instead of the first item, despite my recent suggestion that it would; I changed my mind. In this revision, bt_right_page_check_scankey() continues to return the first item on the next/right page (as a scankey), just as in the first version. The locking considerations here are more complicated than before, and hopefully correct. I abandoned the high key idea when I realized that a concurrent page split could happen in the right sibling, making the high key generally no more safe in the face of concurrent target page deletion than continuing to use the first data item. Using the first data item from the right page is better than using the high key item from the right page verification-wise anyway, so that's good. I found a new way of locking down and recovering from any race conditions (the low probability bug in V1 for bt_index_check()/AccessShareLock calls to the C function bt_right_page_check_scankey()). I won't go into the gory details here, but I will mention that my new approach is inspired by how backwards scans work already. * Adds DEBUG1 traces that indicate the level and page currently being checked. Advanced users will probably find these traces useful on occasion. It's nice to be able to see how amcheck walks the B-Tree in practice. Non-behavioral changes: * Explains why an ExclusiveLock is needed on the target index by bt_index_parent_check() (and a ShareLock on its heap relation). This new explanation talks about why an ExclusiveLock *is* necessary when checking downlinks against child pages, per Tomas Vondra's request. (Note that nothing changes about child/downlink verification itself). Before now, I focused on why they were not necessary, but Tomas was right to point out that that isn't enough information. * No longer uses the word "comport" in an error message, per Amit Langote. Similarly, minor typos fixed, per Tomas. * Adds a variety of pg_regress-based tests for external sorting. (Via various CREATE INDEX statements involving a variety of datatypes. Data is generated pseudo-randomly.) Notably, the regression tests *never* test external sorting today. ISTM that amcheck is a convenient, reliable, and portable way of testing the sorting of collatable types like text, so this is the best way of testing external sorting. External sorting uses abbreviated keys for runs, but discards them as runs are written out, and so the merging of runs does not use abbreviated keys. Abbreviated comparisons and authoritative comparisons are therefore reliably used within the same sort. One particular goal here is that something like the recent strxfrm() debacle might be caught by this testing on some platforms (if we ever get back to trusting strxfrm() at all, that is). The style of these new tests is unorthodox, because a variety of collations are tested, which varies from system to system; we should discuss all this. The tests can take a minute or two to run on my laptop, but that could vary significantly. And so, whether or not that needs to be targeted by something other than "check" and "installcheck" is a discussion that needs to happen. I imagine a lot of buildfarm animals have slow storage, but on the other hand I don't think hackers run the contrib regression tests so often. I prefer to not follow the example of test_decoding if possible; test_decoding's Makefile only runs tests when wal_level=logical is expected (that's how the relevant Makefile targets are scoped), but it's a complicated special case. Review ------ As already noted after the first bullet point, there are some tricky considerations on fixing the race when the user calls bt_index_check(). Reviewers should pay close attention to this; could I be wrong about that being race-safe even now? This seems to me to be by far the best place for reviewers to look for problems in this patch, as it's the only bt_index_check()/AccessShareLock case that compares anything *across* pages. The measures taken to prevent false positives described within bt_target_page_check() require expert review, especially because this is the one case where a faulty rationale may result in under-protection from false positives, and not just harmless over-protection. The rationale for why we can reliably recover from a race described within the C function bt_right_page_check_scankey() is *complicated*. I may have failed to make it as simple as possible despite significant effort. It's hard to describe these things in an accessible way. Maybe someone can try and understand what I've said there, and let me know how I might have fallen short. I have used a new term, "cousin page" in this explanation (cf. sibling page). This seems like useful terminology that makes these difficult explanations a bit more accessible. -- Peter Geoghegan
Attachment
pgsql-hackers by date: