Re: Make relfile tombstone files conditional on WAL level - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Make relfile tombstone files conditional on WAL level |
Date | |
Msg-id | CA+Tgmoa0dBdYhm2EWWLmu9Kh9qhibPun1zLa-aCg+p6q+AQAHA@mail.gmail.com Whole thread Raw |
In response to | Re: Make relfile tombstone files conditional on WAL level (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Make relfile tombstone files conditional on WAL level
|
List | pgsql-hackers |
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It would indeed be nice to have some other mechanism to prevent the > issue with wal_level=minimal, the tombstone files feel hacky and > complicated. Maybe a new shared memory hash table to track the > relfilenodes of dropped tables. Just to summarize the issue here as I understand it, if a relfilenode is used for two unrelated relations during the same checkpoint cycle with wal_level=minimal, and if the WAL-skipping optimization is applied to the second of those but not to the first, then crash recovery will lose our only copy of the new relation's data, because we'll replay the removal of the old relfilenode but will not have logged the new data. Furthermore, we've wondered about writing an end-of-recovery record in all cases rather than sometimes writing an end-of-recovery record and sometimes a checkpoint record. That would allow another version of this same problem, since a single checkpoint cycle could now span multiple server lifetimes. At present, we dodge all this by keeping the first segment of the main fork around as a zero-length file for the rest of the checkpoint cycle, which I think prevents the problem in both cases. Now, apparently that caused some problem with the AIO patch set so Thomas is curious about getting rid of it, and Heikki concurs that it's a hack. I guess my concern about this patch is that it just seems to be reducing the number of cases where that hack is used without actually getting rid of it. Rarely-taken code paths are more likely to have undiscovered bugs, and that seems particularly likely in this case, because this is a low-probability scenario to begin with. A lot of clusters probably never have an OID counter wraparound ever, and even in those that do, getting an OID collision with just the right timing followed by a crash before a checkpoint can intervene has got to be super-unlikely. Even as things are today, if this mechanism has subtle bugs, it seems entirely possible that they could have escaped notice up until now. So I spent some time thinking about the question of getting rid of tombstone files altogether. I don't think that Heikki's idea of a shared memory hash table to track dropped relfilenodes can work. The hash table will have to be of some fixed size N, and whatever the value of N, the approach will break down if N+1 relfilenodes are dropped in the same checkpoint cycle. The two most principled solutions to this problem that I can see are (1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have been reluctant to support #1 because it's hard for me to believe that there aren't cases where being able to skip a whole lot of WAL-logging doesn't work out to a nice performance win, but I realize opinions on that topic vary. And I'm pretty sure that Andres, at least, will hate #2 because he's unhappy with the width of buffer tags already. So I don't really have a good idea. I agree this tombstone system is a bit of a wart, but I'm not sure that this patch really makes anything any better, and I'm not really seeing another idea that seems better either. Maybe I am missing something... -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: