Thread: Undiagnosed bug in Bloom index
I am getting corrupted Bloom indexes in which a tuple in the table heap is not in the index. I see it as early as commit a9284849b48b, with commit e13ac5586c49c cherry picked onto it. I don't see it before a9284849b48b because the test-case seg faults before anything interesting can happen. I think this is an ab initio bug, either in bloom contrib or in the core index am. I see it as recently as 371b572, which is as new as I have tested. The problem is that an update which must properly update exactly one row instead updates zero rows. It takes 5 to 16 hours to reproduce when run as 8 clients on 8 cores. I suspect it is some kind of race condition, and testing with more clients on more cores would make it happen faster. If you inject crash/recovery cycles into the system, it seems to happen sooner. But crash/recovery cycles are not necessary. If you use the attached do_nocrash.sh script, the error will generate a message like: child abnormal exit update did not update 1 row: key 6838 updated 0E0 at count.pl line 189.\n at count.pl line 197. (And I've added code so that once this is detected, the script will soon terminate) If you want to run do_nocrash.sh, change the first few lines to hardcode the correct path for the binaries and the temp data directory (which will be ruthlessly deleted). It will run on an unpatched server, since crash injection is turned off. If you want to make it fork more clients, change the 8 in 'perl count.pl 8 0|| on_error;' I have preserved a large tarball (215M) of a corrupt data directory. It was run with the a928484 compilation with e13ac5586 cherrypicked, and is at https://drive.google.com/open?id=0Bzqrh1SO9FcEci1FQTkwZW9ZU1U. Or, if you can tell me how to look for myself (pageinspect doesn't offer much for Bloom). With that tarball, the first query using the index returns nothing, will the second forcing a seq scan returns a row: select * from foo where bloom = md5('6838'); select * from foo where bloom||'' = md5('6838'); The machinery posted here is probably much more elaborate than necessary to detect the problem. You could probably detect it with pgbench -N, except that that doesn't check the results to make sure the expected number of rows were actually selected/updated. Cheers, Jeff
Attachment
Jeff Janes <jeff.janes@gmail.com> writes: > I am getting corrupted Bloom indexes in which a tuple in the table > heap is not in the index. Hmm. I can trivially reproduce a problem, but I'm not entirely sure whether it matches yours. Same basic test case as the bloom regression test: regression=# CREATE TABLE tst ( i int4, t text ); CREATE TABLE regression=# CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3); CREATE INDEX regression=# INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i; INSERT 0 2000 regression=# vacuum verbose tst; ... INFO: index "bloomidx" now contains 2000 row versions in 5 pages ... regression=# delete from tst; DELETE 2000 regression=# vacuum verbose tst; ... INFO: index "bloomidx" now contains 0 row versions in 5 pages DETAIL: 2000 index row versions were removed. ... regression=# INSERT INTO tst SELECT i%10, substr(md5(i::text), 1, 1) FROM generate_series(1,2000) i; INSERT 0 2000 regression=# vacuum verbose tst; ... INFO: index "bloomidx" now contains 1490 row versions in 5 pages ... Ooops. (Note: this is done with some fixes already in place to make blvacuum.c return correct tuple counts for VACUUM VERBOSE; right now it tends to double-count during a VACUUM.) The problem seems to be that (1) blvacuum marks all the index pages as BLOOM_DELETED, but doesn't bother to clear the notFullPage array on the metapage; (2) blinsert uses a page from notFullPage, failing to notice that it's inserting data into a BLOOM_DELETED page; (3) once we ask the FSM for a page, it returns a BLOOM_DELETED page that we've already put tuples into, which we happily blow away by reinit'ing the page. A race-condition variant of this could be that after an autovacuum has marked a page BLOOM_DELETED, but before it's reached the point of updating the metapage, blinsert could stick data into the deleted page. That would make it possible to reach the problem without requiring the extreme edge case that blvacuum finds no partly-full pages to put into the metapage. If this does explain your problem, it's probably that variant. Will push a fix in a bit. regards, tom lane
On Sat, Aug 13, 2016 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> I am getting corrupted Bloom indexes in which a tuple in the table >> heap is not in the index. > .... > Will push a fix in a bit. After 36 hours of successful running on two different machines (one with crash injection turned on, one without), I am pretty confident that your fix is working. Thanks, Jeff