Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Date | |
Msg-id | 20211122223436.ov3qa27pakxay6z7@alap3.anarazel.de Whole thread Raw |
In response to | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
|
List | pgsql-bugs |
Hi, On 2021-11-22 12:41:32 -0800, Peter Geoghegan wrote: > On Mon, Nov 22, 2021 at 9:59 AM Andres Freund <andres@anarazel.de> wrote: > > I'm not quite sure what you referring to with "convincing me"? Whether to go > > for something comprehensive or more minimal? I'm kind of agnostic on what the > > right approach is. > > I thought that you were clearly leaning in the direction of a > minimal-only fix for backpatch. I just meant that there is no point in > saying much more about it. It's not like I have a problem with your > approach. My personal opinion is that we'd be somewhat better off if > we went with my more comprehensive approach, even on 14. That's not > how you see it - which is fair enough (reasoning about these risks is > hard, and it tends to be quite subjective). I think I'm actually just not sure what the better approach for the backbranches is. I'm worried about the size of your patch, I'm worried about the craziness of the current architecture (if you can call it that) of heap_page_prune() with my patch. I think either way your approach should incorporate the dedicated loop to check the visibility - for performance reasons alone. > > I think it's worth to clean up the regression test I wrote and use it > > regardless of which version of the fix we end up choosing. But I'm a bit bit > > on the fence - it's quite complicated. > > +1 to the idea of keeping it somewhere. Without necessarily running it > on the BF. > > > OTOH, I think with the framework in place it'd not be too hard to write a few > > more tests for odd pruning scenarios... > > Can we commit a test like this without running it by default, at all? > It's not like it has never been done before. Is there a reason not to run it if we commit it? IME tests not run by default tend to bitrot very quickly. The only issue that I can think of is that it might be hard to make the test useful and robust in the presence of debug_discard_caches != 0. > The nice thing about using amcheck for something like this (compared > to pageinspect) is that we don't have to specify exact details about > what it looks like when the test passes. Because passing the test just > means the absence of any error. Unless amcheck gets more in-depth there's just too much it doesn't show :( > How hard would it be to just add amcheck coverage on HEAD? Something > that goes just a bit further with LP_REDIRECT validation (currently > verify_heapam does practically nothing like that). We should add > comprehensive HOT chain validation too, but that can wait a little > while longer. It's a lot more complicated. What exactly do you mean with "amcheck coverage" in this context? > Another thing is that the new assertions themselves are part of our > overall testing strategy. I see that you've gone further with that > than I did in your v3-0003-* patch. I should adopt that code to my > more comprehensive fix for HEAD, while still keeping around the > existing heap_prune_chain() assertions (perhaps just as > documentation/comments to make the code easier to follow). I found that style of check helpful because it notices problems sooner. What took the longest debugging this issue was that simple assertions (e.g. during heap_page_prune_execute() or heap_prune_record_unused()) would trigger well after the actual bug. E.g. marking an item unused that is referenced by a pre-existing LP_REDIRECT item can't easily be detected at that time. It'd trigger an assertion at the time the LP_REDIRECT item is marked dead, but that doesn't help to pinpoint the origin of the problem much. I'm inclined to think that the costs of the check are worth incurring by default for cassert builds. I wish I knew of a good way to have pgbench check the validity of its results. There's a huge difference between not crashing and returning correct results. And there's just a lot of bugs that will result in valid-enough on-disk/in-buffer data, but which are not correct. Greetings, Andres Freund
pgsql-bugs by date: