Re: SimpleLruTruncate() mutual exclusion - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: SimpleLruTruncate() mutual exclusion |
Date | |
Msg-id | 20190801065117.GA2355684@rfd.leadboat.com Whole thread Raw |
In response to | Re: SimpleLruTruncate() mutual exclusion (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: SimpleLruTruncate() mutual exclusion
Re: SimpleLruTruncate() mutual exclusion |
List | pgsql-hackers |
On Mon, Jul 29, 2019 at 12:58:17PM -0400, Tom Lane wrote: > > On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: > >>> b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > >>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > >>> checkpoint, or in the startup process. Hence, also arrange for only one > >>> backend to call SimpleLruTruncate(AsyncCtl) at a time. > > >> More specifically, restrict vac_update_datfrozenxid() to one backend per > >> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one > >> backend per cluster. This, attached, was rather straightforward. > I'm pretty sure it was intentional that multiple backends > could run truncation directory scans concurrently, and I don't really > want to give that up if possible. I want to avoid a design that requires painstaking concurrency analysis. Such analysis is worth it for heap_update(), but vac_truncate_clog() isn't enough of a hot path. If there's a way to make vac_truncate_clog() easy to analyze and still somewhat concurrent, fair. > Also, if I understand the data-loss hazard properly, it's what you > said in the other thread: the latest_page_number could advance after > we make our decision about what to truncate, and then maybe we could > truncate newly-added data. We surely don't want to lock out the > operations that can advance last_page_number, so how does serializing > vac_truncate_clog help? > > I wonder whether what we need to do is add some state in shared > memory saying "it is only safe to create pages before here", and > make SimpleLruZeroPage or its callers check that. The truncation > logic would advance that value, but only after completing a scan. > (Oh ..., hmm, maybe the point of serializing truncations is to > ensure that we know that we can safely advance that?) vac_truncate_clog() already records "it is only safe to create pages before here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks. The trouble comes when two vac_truncate_clog() run in parallel and you get a sequence of events like this: vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink vac_truncate_clog() instance 1 unlinks segment ABCD vac_truncate_clog() instance 1 calls SetTransactionIdLimit() vac_truncate_clog() instance 1 finishes some backend calls SimpleLruZeroPage(), creating segment ABCD vac_truncate_clog() instance 2 unlinks segment ABCD Serializing vac_truncate_clog() fixes that. > Can you post whatever script you used to detect/reproduce the problem > in the first place? I've attached it, but it's pretty hackish. Apply the patch on commit 7170268, make sure your --bindir is in $PATH, copy "conf-test-pg" to your home directory, and run "make trunc-check". This incorporates xid-burn acceleration code that Jeff Janes shared with -hackers at some point. > PS: also, re-reading this code, it's worrisome that we are not checking > for failure of the unlink calls. I think the idea was that it didn't > really matter because if we did re-use an existing file we'd just > re-zero it; hence failing the truncate is an overreaction. But probably > some comments about that are in order. That's my understanding. We'll zero any page before reusing it. Failing the whole vac_truncate_clog() (and therefore not advancing the wrap limit) would do more harm than the bit of wasted disk space. Still, a LOG or WARNING message would be fine, I think. Thanks, nm
Attachment
pgsql-hackers by date: