Thread: Intermittent "cache lookup failed for type" buildfarm failures
There is something rotten in the state of Denmark. Here are four recent runs that failed with unexpected "cache lookup failed for type nnnn" errors: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18 The first two are on HEAD, the second two on 9.5, which seems to rule out my first thought that this has something to do with parallel query. It's notable though that all the failing machines are PPC or S/390 ... maybe big-endian related? I grepped through the buildfarm logs and determined that there are exactly zero similar failures going back as far as 2016-04-01. Now that we've had four in a week, it seems certain that this indicates a bug introduced at most a few days before Aug 9. A quick trawl through the git logs finds no obvious candidates, though. Any ideas? regards, tom lane
On Tue, Aug 16, 2016 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There is something rotten in the state of Denmark. Here are four recent > runs that failed with unexpected "cache lookup failed for type nnnn" > errors: > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18 > > The first two are on HEAD, the second two on 9.5, which seems to rule out > my first thought that this has something to do with parallel query. It's > notable though that all the failing machines are PPC or S/390 ... maybe > big-endian related? > > I grepped through the buildfarm logs and determined that there are exactly > zero similar failures going back as far as 2016-04-01. Now that we've had > four in a week, it seems certain that this indicates a bug introduced at > most a few days before Aug 9. A quick trawl through the git logs finds > no obvious candidates, though. Well, it would have to be something that was back-patched to 9.5, right? That doesn't leave too many candidates. [rhaas pgsql]$ git log --format=oneline --before='Aug 10' --after='Aug 6' REL9_5_STABLE src/backend/ 04cee8f835bcf95ff80b734c335927aaf6551d2d Fix several one-byte buffer over-reads in to_number 4da812fa8adb22874a937f1b000253fecf526cb0 Translation updates 98b0c6280667ce1efae763340fb2c13c81e4d706 Fix two errors with nested CASE/WHEN constructs. cb5c14984ad327e52dfb470fde466a5aca7d50a1 Fix misestimation of n_distinct for a nearly-unique column with many nulls. 71dca408c0030ad76044c6b17367c9fbeac511ec Don't propagate a null subtransaction snapshot up to parent transaction. Obviously, the third and fourth of those seem like the most likely candidates, but I don't have any theory on how either of them could be causing this. It would sure be nice if those cache lookup failure messages printed the file and line number. I wonder if we could teach psql to always treat the VERBOSITY as verbose when the error code is XX000. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Aug 16, 2016 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I grepped through the buildfarm logs and determined that there are exactly >> zero similar failures going back as far as 2016-04-01. Now that we've had >> four in a week, it seems certain that this indicates a bug introduced at >> most a few days before Aug 9. A quick trawl through the git logs finds >> no obvious candidates, though. > Well, it would have to be something that was back-patched to 9.5, > right? That doesn't leave too many candidates. It's possible that the bug existed longer and was only exposed by a seemingly unrelated change (eg, a test timing change, since it certainly looks like it might be timing dependent). That's little help though :-( > It would sure be nice if those cache lookup failure messages printed > the file and line number. I wonder if we could teach psql to always > treat the VERBOSITY as verbose when the error code is XX000. I looked around when I saw the earlier ones of these, and had more or less convinced myself that the errors were probably coming from one of the lsyscache.c convenience subroutines. If that's true, we'd need a stack trace to have much hope of identifying the cause. regards, tom lane
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > It would sure be nice if those cache lookup failure messages printed > > the file and line number. I wonder if we could teach psql to always > > treat the VERBOSITY as verbose when the error code is XX000. > > I looked around when I saw the earlier ones of these, and had more or less > convinced myself that the errors were probably coming from one of the > lsyscache.c convenience subroutines. If that's true, we'd need a stack > trace to have much hope of identifying the cause. Maybe we can have a code path that calls backtrace() somewhere in errfinish, for platforms that support that. At least grouse uses gcc, so I suppose it must also use glibc. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote: > There is something rotten in the state of Denmark. Here are four recent > runs that failed with unexpected "cache lookup failed for type nnnn" > errors: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18 I believe I've figured this out. I realized that all the possible instances of "cache lookup failed for type" are reporting failures of SearchSysCache1(TYPEOID, ...) or related calls, and therefore I could narrow this down by setting a breakpoint there on the combination of cacheId = TYPEOID and key1 > 16384 (since the OIDs reported for the failures are clearly for some non-builtin type). After a bit of logging it became clear that the only such calls occurring in the statements that are failing in the buildfarm are coming from the parser's attempts to resolve an operator name. And then it was blindingly obvious what changed recently: commits f0c7b789a et al added a test case in case.sql that creates and then drops both an '=' operator and the type it's for. And that runs in parallel with the failing tests, which all need to resolve operators named '='. So in the other sessions, the parser is seeing that transient '=' operator as a possible candidate, but then when it goes to test whether that operator could match the actual inputs, the type is already gone (causing a failure in getBaseType or get_element_type or possibly other places). The best short-term fix, and the only one I'd consider back-patching, is to band-aid the test to prevent this problem, probably by wrapping that whole test case in BEGIN ... ROLLBACK so that concurrent tests never see the transient '=' operator. In the long run, it'd be nice if we were more robust about such situations, but I have to admit I have no idea how to go about making that so. Certainly, just letting the parser ignore catalog lookup failures doesn't sound attractive. regards, tom lane