Home > mailing lists

Nasty VACUUM/bgwriter/segmentation bug - Mailing list pgsql-hackers

From	Tom Lane
Subject	Nasty VACUUM/bgwriter/segmentation bug
Date	November 18, 2006 20:34:58
Msg-id	3636.1163896491@sss.pgh.pa.us Whole thread Raw
Responses	Re: Nasty VACUUM/bgwriter/segmentation bug Re: Nasty VACUUM/bgwriter/segmentation bug Re: Nasty VACUUM/bgwriter/segmentation bug
List	pgsql-hackers

Tree view

While working on fixing the recently reported hash-index problem,
I was using a test build with a very small RELSEG_SIZE (128K),
so that I could trigger the reported bug with a reasonably small
amount of data.  And I started finding some unexpected data corruption.
I eventually reduced it to this test case:

checkpoint;
create table foo (f1 int);
insert into foo select x from generate_series(1,100000) x;
-- wait 30 seconds
delete from foo;
vacuum verbose foo;
insert into foo select x from generate_series(1,100000) x;
\q
stop and restart postmaster, then
vacuum verbose foo;

This vacuum will generate a whole lot of 
WARNING:  relation "foo" page 16 is uninitialized --- fixing
WARNING:  relation "foo" page 17 is uninitialized --- fixing
...
and when the dust settles, most of the second batch of 100000 rows
is gone.

What is happening is that during that 30-second wait, the bgwriter is
dumping out all the dirty pages, and acquiring open file references
to each segment of table "foo" as it does so.  The VACUUM then truncates
"foo" back to zero size, since it contains no data after the DELETE,
and then the second INSERT bulks it up again.  The problem is that the
bgwriter still has open file references to the deleted segments after
the first one, and that's where it'll write the data if given a chance.
So the updates disappear into the ether as far as any other process is
concerned, for each segment except the first.

There's a rather indirect mechanism that's supposed to prevent similar
problems between two backends: a file truncation is supposed to be
associated with a forced relcache flush, and that causes an smgrclose(),
so other backends will be forced to reopen the file(s) before they can
do any more work with the truncated relation.  On reflection I think
I don't trust this though, because in the case where a backend writes a
dirty buffer because it needs to reclaim a buffer, it doesn't try to
open the relation at the relcache level (it cannot, because the rel
might be in a different database).  So it could use a stale smgr
relation, same as the bgwriter.  The bgwriter does not participate
in shared cache inval, not having a relcache in the first place, and
so this mechanism doesn't help it anyway.

This is a fairly low-probability bug in real-world cases, because it
could only happen when a relation is truncated and then re-expanded
across a 1GB segment boundary.  Moreover, because the bgwriter flushes
all its open files after each checkpoint, the window for trouble only
extends to the next checkpoint.  But it definitely could happen, and
it might explain some irreproducible corruption reports.

I think that the easiest fix might be to not remove no-longer-used
segment files during a truncate, but simply reduce them to zero size
rather than delete them.  Then any open file pointers aren't
invalidated.  The only alternative I can see is to invent some new
signaling mechanism to force closure of open files, but that seems
ugly, complex, and perhaps subject to race conditions.

Thoughts?
        regards, tom lane

pgsql-hackers by date:

From: Josh Berkus
Date: 18 November 2006, 19:34:28
Subject: Re: [GENERAL] Allowing SYSDATE to Work

From: David Boreham
Date: 18 November 2006, 20:51:20
Subject: Re: Ontology on PostgreSQL - is there something?

Nasty VACUUM/bgwriter/segmentation bug - Mailing list pgsql-hackers

Previous

Next