Nasty VACUUM/bgwriter/segmentation bug - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Nasty VACUUM/bgwriter/segmentation bug |
Date | |
Msg-id | 3636.1163896491@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Nasty VACUUM/bgwriter/segmentation bug
Re: Nasty VACUUM/bgwriter/segmentation bug Re: Nasty VACUUM/bgwriter/segmentation bug |
List | pgsql-hackers |
While working on fixing the recently reported hash-index problem, I was using a test build with a very small RELSEG_SIZE (128K), so that I could trigger the reported bug with a reasonably small amount of data. And I started finding some unexpected data corruption. I eventually reduced it to this test case: checkpoint; create table foo (f1 int); insert into foo select x from generate_series(1,100000) x; -- wait 30 seconds delete from foo; vacuum verbose foo; insert into foo select x from generate_series(1,100000) x; \q stop and restart postmaster, then vacuum verbose foo; This vacuum will generate a whole lot of WARNING: relation "foo" page 16 is uninitialized --- fixing WARNING: relation "foo" page 17 is uninitialized --- fixing ... and when the dust settles, most of the second batch of 100000 rows is gone. What is happening is that during that 30-second wait, the bgwriter is dumping out all the dirty pages, and acquiring open file references to each segment of table "foo" as it does so. The VACUUM then truncates "foo" back to zero size, since it contains no data after the DELETE, and then the second INSERT bulks it up again. The problem is that the bgwriter still has open file references to the deleted segments after the first one, and that's where it'll write the data if given a chance. So the updates disappear into the ether as far as any other process is concerned, for each segment except the first. There's a rather indirect mechanism that's supposed to prevent similar problems between two backends: a file truncation is supposed to be associated with a forced relcache flush, and that causes an smgrclose(), so other backends will be forced to reopen the file(s) before they can do any more work with the truncated relation. On reflection I think I don't trust this though, because in the case where a backend writes a dirty buffer because it needs to reclaim a buffer, it doesn't try to open the relation at the relcache level (it cannot, because the rel might be in a different database). So it could use a stale smgr relation, same as the bgwriter. The bgwriter does not participate in shared cache inval, not having a relcache in the first place, and so this mechanism doesn't help it anyway. This is a fairly low-probability bug in real-world cases, because it could only happen when a relation is truncated and then re-expanded across a 1GB segment boundary. Moreover, because the bgwriter flushes all its open files after each checkpoint, the window for trouble only extends to the next checkpoint. But it definitely could happen, and it might explain some irreproducible corruption reports. I think that the easiest fix might be to not remove no-longer-used segment files during a truncate, but simply reduce them to zero size rather than delete them. Then any open file pointers aren't invalidated. The only alternative I can see is to invent some new signaling mechanism to force closure of open files, but that seems ugly, complex, and perhaps subject to race conditions. Thoughts? regards, tom lane
pgsql-hackers by date: