Re: amcheck assert failure - Mailing list pgsql-bugs
From | Peter Geoghegan |
---|---|
Subject | Re: amcheck assert failure |
Date | |
Msg-id | CAH2-WzmkurhCqnyLHxk0VkOZqd49+ZZsp1xAJOg7j2x7dmp_XQ@mail.gmail.com Whole thread Raw |
In response to | Re: amcheck assert failure (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: amcheck assert failure
Re: amcheck assert failure |
List | pgsql-bugs |
On Mon, Apr 22, 2019 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Nonetheless ... do we really want an assertion failure rather than > some more-mundane error report? Seems like the point of amcheck > is to detect data corruption, so I'd rather get an ERROR than an > assertion (which would not fire in production builds anyway). (Apologies for sending this a second time, Tom -- somehow failed to CC the list the first time) I think that the size cross-check must have failed to catch the problem: /* * lp_len should match the IndexTuple reported length exactly, since * lp_len is completely redundant in indexes, and both sources of * tuple length are MAXALIGN()'d. nbtree does not use lp_len all that * frequently, and is surprisingly tolerant of corrupt lp_len fields. */ if (tupsize != ItemIdGetLength(itemid)) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("index tuple size does not equal lp_len in index \"%s\"", RelationGetRelationName(state->rel)), errdetail_internal("Index tid=(%u,%u) tuple size=%zu lp_len=%u page lsn=%X/%X.", state->targetblock, offset, tupsize, ItemIdGetLength(itemid), (uint32) (state->targetlsn >> 32), (uint32) state->targetlsn), errhint("This could be a torn page problem."))); It seems like the length from tupsize (IndexTupleSize()) happened to match the redundant ItemIdGetLength()-reported size from the line pointer. No idea how that could actually be possible, since the value returned by ItemIdGetLength() would be 12592. What are the chances of that accidentally matching what the wild itup pointer reported as its IndexTupleSize()? The only explanation I can think of is that the assertion failure was from the core code -- _bt_check_natts() is itself called within core code assertions, so it's just about possible that that was how this happened, before amcheck was even called. Doesn't seem like a very good explanation. Do you think that that's possible, Grigory? In any case, you're clearly right that amcheck could be more defensive here. My pg_hexedit tool detected that there was an LP_UNUSED item pointer with storage (i.e. whose lp_len > 0), which is a check that amcheck can and should perform, but doesn't. Especially because it would prevent a deference of a likely-wild pointer (IndexTuple). We could: * Throw an error when any line item reports lp_len == 0. * Throw an error when there is a line item that's either LP_UNUSED, or LP_REDIRECT. The corrupt page sent by Grigory had both. What do you think of the idea of committing some simple checks to do this with contrib/amcheck on v12? -- Peter Geoghegan
pgsql-bugs by date: