Thread: Bug in abbreviated keys abort handling (found with amcheck)
I found another bug as a result of using amcheck on Heroku customer databases. This time, the bug is in core Postgres. It's one of mine. There was a thinko in tuplesort's abbreviation abort logic, causing certain SortTuples to be spuriously marked NULL (and so, subsequently sorted as a NULL tuple, despite not actually changing anything about the representation of caller tuples). The attached patch fixes this bug. I noticed this following a complaint by amcheck about a tuple in the wrong order on a leaf page in some random text index. The leaf page was entirely full of NULL values, aside from this one tuple at some seemingly random position. All non-NULL index tuples were of the kind that you'd expect to trigger abbreviation to abort (many distinct values, but with little entropy at the beginning). I believe that this particular problem has been observed on a tiny fraction of all databases tested, so I don't think it's very common in the wild. I'd be surprised if amcheck does not bring more bugs like this to my attention before too long. We should work on improving it, so that we have greater visibility into problems that occur in the field. -- Peter Geoghegan
Attachment
On Fri, Aug 19, 2016 at 6:07 PM, Peter Geoghegan <pg@heroku.com> wrote: > I found another bug as a result of using amcheck on Heroku customer > databases. This time, the bug is in core Postgres. It's one of mine. > > There was a thinko in tuplesort's abbreviation abort logic, causing > certain SortTuples to be spuriously marked NULL (and so, subsequently > sorted as a NULL tuple, despite not actually changing anything about > the representation of caller tuples). The attached patch fixes this > bug. Ugh, that sucks. Thanks for the report and patch. Committed and back-patched to 9.5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 22, 2016 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Ugh, that sucks. Thanks for the report and patch. Committed and > back-patched to 9.5. Thanks. Within Heroku, there is a lot of enthusiasm for the idea of sharing hard data about the prevalence of problems like this. I hope to be able to share figures in the next few weeks, when I finish working through the backlog. Separately, I would like amcheck to play a role in how we direct users to REINDEX, as issues like this come to light. It would be much more helpful if we didn't have to be so conservative. I hesitate to say that amcheck will detect cases where this bug led to corruption with 100% reliability, but I think that any case that one can imagine in which amcheck fails here is unlikely in the extreme. The same applies to the glibc abbreviated keys issue. I actually didn't find any glibc strxfrm() issues yet, even though any instances of corruption of text indexes I've seen originated before the point release in which strxfrm() became distrusted. I guess that not that many Heroku users use the "C" locale, which would still be affected with the latest point release. -- Peter Geoghegan