Thread: Intermittent "cache lookup failed for type" buildfarm failures

Intermittent "cache lookup failed for type" buildfarm failures

From

Tom Lane

Date:

16 August 2016, 18:21:16

There is something rotten in the state of Denmark.  Here are four recent
runs that failed with unexpected "cache lookup failed for type nnnn"
errors:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18

The first two are on HEAD, the second two on 9.5, which seems to rule out
my first thought that this has something to do with parallel query.  It's
notable though that all the failing machines are PPC or S/390 ... maybe
big-endian related?

I grepped through the buildfarm logs and determined that there are exactly
zero similar failures going back as far as 2016-04-01.  Now that we've had
four in a week, it seems certain that this indicates a bug introduced at
most a few days before Aug 9.  A quick trawl through the git logs finds
no obvious candidates, though.

Any ideas?
        regards, tom lane

Re: Intermittent "cache lookup failed for type" buildfarm failures

From

Robert Haas

Date:

16 August 2016, 18:41:54

On Tue, Aug 16, 2016 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is something rotten in the state of Denmark.  Here are four recent
> runs that failed with unexpected "cache lookup failed for type nnnn"
> errors:
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18
>
> The first two are on HEAD, the second two on 9.5, which seems to rule out
> my first thought that this has something to do with parallel query.  It's
> notable though that all the failing machines are PPC or S/390 ... maybe
> big-endian related?
>
> I grepped through the buildfarm logs and determined that there are exactly
> zero similar failures going back as far as 2016-04-01.  Now that we've had
> four in a week, it seems certain that this indicates a bug introduced at
> most a few days before Aug 9.  A quick trawl through the git logs finds
> no obvious candidates, though.

Well, it would have to be something that was back-patched to 9.5,
right?  That doesn't leave too many candidates.

[rhaas pgsql]$ git log --format=oneline --before='Aug 10' --after='Aug
6' REL9_5_STABLE src/backend/
04cee8f835bcf95ff80b734c335927aaf6551d2d Fix several one-byte buffer
over-reads in to_number
4da812fa8adb22874a937f1b000253fecf526cb0 Translation updates
98b0c6280667ce1efae763340fb2c13c81e4d706 Fix two errors with nested
CASE/WHEN constructs.
cb5c14984ad327e52dfb470fde466a5aca7d50a1 Fix misestimation of
n_distinct for a nearly-unique column with many nulls.
71dca408c0030ad76044c6b17367c9fbeac511ec Don't propagate a null
subtransaction snapshot up to parent transaction.

Obviously, the third and fourth of those seem like the most likely
candidates, but I don't have any theory on how either of them could be
causing this.

It would sure be nice if those cache lookup failure messages printed
the file and line number.  I wonder if we could teach psql to always
treat the VERBOSITY as verbose when the error code is XX000.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Intermittent "cache lookup failed for type" buildfarm failures

From

Tom Lane

Date:

16 August 2016, 18:49:19

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 16, 2016 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I grepped through the buildfarm logs and determined that there are exactly
>> zero similar failures going back as far as 2016-04-01.  Now that we've had
>> four in a week, it seems certain that this indicates a bug introduced at
>> most a few days before Aug 9.  A quick trawl through the git logs finds
>> no obvious candidates, though.

> Well, it would have to be something that was back-patched to 9.5,
> right?  That doesn't leave too many candidates.

It's possible that the bug existed longer and was only exposed by a
seemingly unrelated change (eg, a test timing change, since it certainly
looks like it might be timing dependent).  That's little help though :-(

> It would sure be nice if those cache lookup failure messages printed
> the file and line number.  I wonder if we could teach psql to always
> treat the VERBOSITY as verbose when the error code is XX000.

I looked around when I saw the earlier ones of these, and had more or less
convinced myself that the errors were probably coming from one of the
lsyscache.c convenience subroutines.  If that's true, we'd need a stack
trace to have much hope of identifying the cause.
        regards, tom lane

Re: Intermittent "cache lookup failed for type" buildfarm failures

From

Alvaro Herrera

Date:

16 August 2016, 19:03:27

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:

> > It would sure be nice if those cache lookup failure messages printed
> > the file and line number.  I wonder if we could teach psql to always
> > treat the VERBOSITY as verbose when the error code is XX000.
> 
> I looked around when I saw the earlier ones of these, and had more or less
> convinced myself that the errors were probably coming from one of the
> lsyscache.c convenience subroutines.  If that's true, we'd need a stack
> trace to have much hope of identifying the cause.

Maybe we can have a code path that calls backtrace() somewhere in
errfinish, for platforms that support that.  At least grouse uses gcc,
so I suppose it must also use glibc.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Intermittent "cache lookup failed for type" buildfarm failures

From

Tom Lane

Date:

25 August 2016, 13:32:26

I wrote:
> There is something rotten in the state of Denmark.  Here are four recent
> runs that failed with unexpected "cache lookup failed for type nnnn"
> errors:

> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18

I believe I've figured this out.

I realized that all the possible instances of "cache lookup failed for
type" are reporting failures of SearchSysCache1(TYPEOID, ...) or related
calls, and therefore I could narrow this down by setting a breakpoint
there on the combination of cacheId = TYPEOID and key1 > 16384 (since the
OIDs reported for the failures are clearly for some non-builtin type).
After a bit of logging it became clear that the only such calls occurring
in the statements that are failing in the buildfarm are coming from the
parser's attempts to resolve an operator name.  And then it was blindingly
obvious what changed recently: commits f0c7b789a et al added a test case
in case.sql that creates and then drops both an '=' operator and the type
it's for.  And that runs in parallel with the failing tests, which all
need to resolve operators named '='.  So in the other sessions, the parser
is seeing that transient '=' operator as a possible candidate, but then
when it goes to test whether that operator could match the actual inputs,
the type is already gone (causing a failure in getBaseType or
get_element_type or possibly other places).

The best short-term fix, and the only one I'd consider back-patching,
is to band-aid the test to prevent this problem, probably by wrapping
that whole test case in BEGIN ... ROLLBACK so that concurrent tests
never see the transient '=' operator.

In the long run, it'd be nice if we were more robust about such
situations, but I have to admit I have no idea how to go about
making that so.  Certainly, just letting the parser ignore catalog
lookup failures doesn't sound attractive.
        regards, tom lane