Thread: Cache lookup failed for relation

Cache lookup failed for relation

From

David Clymer

Date:

11 February 2013, 17:09:39

I've been seeing the following error in one database of ours:

  "cache lookup failed for relation 7640518"

The SQL that apparently triggers this is:

   drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

Unfortunately, manual attempts to reproduce the issue have failed. In
normal operation, this statement is run as one of several parallel queries,
and the tables are by nature, short lived. That said, they are not
temporary tables.

This is one of two very similar databases, and we are running the same
software (same version) on top of each. The databases are in different
versions of postgresql. Db #1 is postgresql 9.2.3, and db #2 (the one
exhibiting the above behavior) is postgresql 9.0.11.

One other item of note: db #2 has recently had an OID wrap-around, which
makes me suspect that plays some part in this behavior. I've looked at the
caching code in postgresql, and though I cannot claim to have a thorough
understanding of how it works, I have a theory.

How the cache works (as I understand it):

    When a cache lookup is performed, tuples are looked up by OID. The
cache contents are hashed into buckets. If an item is found in the cache,
it is promoted to the top of the bucket so that subsequent searches are
faster. If an item is not in the cache, it is looked up in the system
catalog, and an entry is inserted into the cache. If a lookup in the
catalog fails, a negative entry is added to the cache for the tuple.
Multiple entries can exist for the same tuple. The latest one is just
promoted to the top of the bucket, and the other gets aged out of the
cache, since it is never again accessed.

Theory:

    Given that we have wrapped around our OID counter, it is possible to
have multiple entries in the cache for the same OID. If one relation is
deleted, and a negative entry inserted into the cache, attempts to look up
the other may erroneously produce a negative cache hit, yielding our "cache
lookup failed for relation" error.


Is this a possibility? Are there any other obvious explanation for this?
The results from google related to this error seem to point to catalog
corruption, or a postgres bug.

Any pointers/enlightenment would be appreciated.

-davidc

--
*David Clymer*
VistaShare
 866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/>

[image: Facebook]   www.facebook.com/vistashare
[image: Twitter]   www.twitter.com/vistashare

Re: Cache lookup failed for relation

From

Tom Lane

Date:

11 February 2013, 17:20:36

David Clymer <david.clymer@vistashare.com> writes:
> I've been seeing the following error in one database of ours:
>   "cache lookup failed for relation 7640518"

Always the same OID, or does it change?

> The SQL that apparently triggers this is:
>    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;

> Unfortunately, manual attempts to reproduce the issue have failed. In
> normal operation, this statement is run as one of several parallel queries,
> and the tables are by nature, short lived. That said, they are not
> temporary tables.

Hm ... what are the parallel queries exactly?  If you're doing something
like dropping both ends of a foreign-key linkage in parallel, I'd not be
very astonished by an error like this, especially not in 9.0.x.  It'd be
basically a race condition between two sessions both locking the same
table, but by the time the second one gets the lock, the first one has
dropped the table.  (Robert Haas has done some great work towards
eliminating this type of race condition lately, but it's sure not in
9.0.x.)

> One other item of note: db #2 has recently had an OID wrap-around, which
> makes me suspect that plays some part in this behavior.

I don't believe that theory at all.

            regards, tom lane

Re: Cache lookup failed for relation

From

Péter Kovács

Date:

11 February 2013, 17:47:15

<p dir="ltr">I'm confused by the error message. Is a cache miss an error condition?<p dir="ltr">Thanks<br /> Peter<div
class="gmail_quote">OnFeb 11, 2013 6:22 PM, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<brtype="attribution" /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">David Clymer <<a
href="mailto:david.clymer@vistashare.com">david.clymer@vistashare.com</a>>writes:<br /> > I've been seeing the
followingerror in one database of ours:<br /> >   "cache lookup failed for relation 7640518"<br /><br /> Always the
sameOID, or does it change?<br /><br /> > The SQL that apparently triggers this is:<br /> >    drop table if
existsns_e5461ae570429d0b7863cce9ef4d4ead;<br /><br /> > Unfortunately, manual attempts to reproduce the issue have
failed.In<br /> > normal operation, this statement is run as one of several parallel queries,<br /> > and the
tablesare by nature, short lived. That said, they are not<br /> > temporary tables.<br /><br /> Hm ... what are the
parallelqueries exactly?  If you're doing something<br /> like dropping both ends of a foreign-key linkage in parallel,
I'dnot be<br /> very astonished by an error like this, especially not in 9.0.x.  It'd be<br /> basically a race
conditionbetween two sessions both locking the same<br /> table, but by the time the second one gets the lock, the
firstone has<br /> dropped the table.  (Robert Haas has done some great work towards<br /> eliminating this type of
racecondition lately, but it's sure not in<br /> 9.0.x.)<br /><br /> > One other item of note: db #2 has recently
hadan OID wrap-around, which<br /> > makes me suspect that plays some part in this behavior.<br /><br /> I don't
believethat theory at all.<br /><br />                         regards, tom lane<br /><br /><br /> --<br /> Sent via
pgsql-generalmailing list (<a href="mailto:pgsql-general@postgresql.org">pgsql-general@postgresql.org</a>)<br /> To
makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-general"
target="_blank">http://www.postgresql.org/mailpref/pgsql-general</a><br/></blockquote></div>

Re: Cache lookup failed for relation

From

Pavel Stehule

Date:

11 February 2013, 17:47:47

2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
> David Clymer <david.clymer@vistashare.com> writes:
>> I've been seeing the following error in one database of ours:
>>   "cache lookup failed for relation 7640518"
>
> Always the same OID, or does it change?
>
>> The SQL that apparently triggers this is:
>>    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;
>
>> Unfortunately, manual attempts to reproduce the issue have failed. In
>> normal operation, this statement is run as one of several parallel queries,
>> and the tables are by nature, short lived. That said, they are not
>> temporary tables.
>
> Hm ... what are the parallel queries exactly?  If you're doing something
> like dropping both ends of a foreign-key linkage in parallel, I'd not be
> very astonished by an error like this, especially not in 9.0.x.  It'd be
> basically a race condition between two sessions both locking the same
> table, but by the time the second one gets the lock, the first one has
> dropped the table.  (Robert Haas has done some great work towards
> eliminating this type of race condition lately, but it's sure not in
> 9.0.x.)

we can see same behave in 9.1

when you try drop some tables in parallel sessions

Regards

Pavel Stehule

>
>> One other item of note: db #2 has recently had an OID wrap-around, which
>> makes me suspect that plays some part in this behavior.
>
> I don't believe that theory at all.
>
>                         regards, tom lane
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Cache lookup failed for relation

From

Tom Lane

Date:

11 February 2013, 17:55:00

Péter Kovács <peter.dunay.kovacs@gmail.com> writes:
> I'm confused by the error message. Is a cache miss an error condition?

Well, this isn't a "cache miss", it's more of a "there's no such OID in
the pg_class catalog" condition.  Normally you see something more
user-friendly; but in the case of going to remove a cross-table linkage,
the code isn't expecting the other table to not be there, so you get a
pretty low-level error.
        regards, tom lane

Re: Cache lookup failed for relation

From

David Clymer

Date:

11 February 2013, 18:11:40

On Mon, Feb 11, 2013 at 12:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> David Clymer <david.clymer@vistashare.com> writes:
> > I've been seeing the following error in one database of ours:
> >   "cache lookup failed for relation 7640518"
>
> Always the same OID, or does it change?
>

It appears that almost all instances reference a different OID.


>
> > The SQL that apparently triggers this is:
> >    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;
>
> > Unfortunately, manual attempts to reproduce the issue have failed. In
> > normal operation, this statement is run as one of several parallel
> queries,
> > and the tables are by nature, short lived. That said, they are not
> > temporary tables.
>
> Hm ... what are the parallel queries exactly?


Sorry, that's our application level terminology. As far as postgres is
concerned they are just individual queries running at the roughly same time.


> If you're doing something
> like dropping both ends of a foreign-key linkage in parallel, I'd not be
> very astonished by an error like this, especially not in 9.0.x.  It'd be
> basically a race condition between two sessions both locking the same
> table, but by the time the second one gets the lock, the first one has
> dropped the table.  (Robert Haas has done some great work towards
> eliminating this type of race condition lately, but it's sure not in
> 9.0.x.)
>

I don't think we are doing that, but it may be that two queries are
attempting to drop the same table "if exists". I'll have to look at that a
bit more.

The SERIALIZABLE isolation mode is being used in 9.0, and REPEATABLE READ
in 9.2, which should be the same thing, correct (eg. 9.0 serializable ~ 9.2
repeatable read)?


> > One other item of note: db #2 has recently had an OID wrap-around, which
> > makes me suspect that plays some part in this behavior.
>
> I don't believe that theory at all.
>

Good to know.

-davidc

--
*David Clymer*
VistaShare
 866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/>

[image: Facebook]   www.facebook.com/vistashare
[image: Twitter]   www.twitter.com/vistashare

Re: Cache lookup failed for relation

From

David Clymer

Date:

11 February 2013, 18:13:07

On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:

> 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
> > David Clymer <david.clymer@vistashare.com> writes:
> >> I've been seeing the following error in one database of ours:
> >>   "cache lookup failed for relation 7640518"
> >
> > Always the same OID, or does it change?
> >
> >> The SQL that apparently triggers this is:
> >>    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;
> >
> >> Unfortunately, manual attempts to reproduce the issue have failed. In
> >> normal operation, this statement is run as one of several parallel
> queries,
> >> and the tables are by nature, short lived. That said, they are not
> >> temporary tables.
> >
> > Hm ... what are the parallel queries exactly?  If you're doing something
> > like dropping both ends of a foreign-key linkage in parallel, I'd not be
> > very astonished by an error like this, especially not in 9.0.x.  It'd be
> > basically a race condition between two sessions both locking the same
> > table, but by the time the second one gets the lock, the first one has
> > dropped the table.  (Robert Haas has done some great work towards
> > eliminating this type of race condition lately, but it's sure not in
> > 9.0.x.)
>
> we can see same behave in 9.1
>
> when you try drop some tables in parallel sessions
>
>
OK,
--
*David Clymer*
VistaShare
 866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/>

[image: Facebook]   www.facebook.com/vistashare
[image: Twitter]   www.twitter.com/vistashare

Re: Cache lookup failed for relation

From

David Clymer

Date:

11 February 2013, 18:18:02

On Mon, Feb 11, 2013 at 1:13 PM, David Clymer
<david.clymer@vistashare.com>wrote:

> On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:
>
>> 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
>> > David Clymer <david.clymer@vistashare.com> writes:
>> >> I've been seeing the following error in one database of ours:
>> >>   "cache lookup failed for relation 7640518"
>> >
>> > Always the same OID, or does it change?
>> >
>> >> The SQL that apparently triggers this is:
>> >>    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;
>> >
>> >> Unfortunately, manual attempts to reproduce the issue have failed. In
>> >> normal operation, this statement is run as one of several parallel
>> queries,
>> >> and the tables are by nature, short lived. That said, they are not
>> >> temporary tables.
>> >
>> > Hm ... what are the parallel queries exactly?  If you're doing something
>> > like dropping both ends of a foreign-key linkage in parallel, I'd not be
>> > very astonished by an error like this, especially not in 9.0.x.  It'd be
>> > basically a race condition between two sessions both locking the same
>> > table, but by the time the second one gets the lock, the first one has
>> > dropped the table.  (Robert Haas has done some great work towards
>> > eliminating this type of race condition lately, but it's sure not in
>> > 9.0.x.)
>>
>> we can see same behave in 9.1
>>
>> when you try drop some tables in parallel sessions
>>
>>
OK, so perhaps the difference is purely due to the use of postgres < 9.2 on
one db.

-davidc


--
*David Clymer*
VistaShare
 866-828-4782, ext. 828
www.VistaShare.com <http://www.vistashare.com/>

[image: Facebook]   www.facebook.com/vistashare
[image: Twitter]   www.twitter.com/vistashare

Re: Cache lookup failed for relation

From

Pavel Stehule

Date:

11 February 2013, 18:22:58

2013/2/11 David Clymer <david.clymer@vistashare.com>

> On Mon, Feb 11, 2013 at 1:13 PM, David Clymer <david.clymer@vistashare.com
> > wrote:
>
>> On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote:
>>
>>> 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>:
>>> > David Clymer <david.clymer@vistashare.com> writes:
>>> >> I've been seeing the following error in one database of ours:
>>> >>   "cache lookup failed for relation 7640518"
>>> >
>>> > Always the same OID, or does it change?
>>> >
>>> >> The SQL that apparently triggers this is:
>>> >>    drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead;
>>> >
>>> >> Unfortunately, manual attempts to reproduce the issue have failed. In
>>> >> normal operation, this statement is run as one of several parallel
>>> queries,
>>> >> and the tables are by nature, short lived. That said, they are not
>>> >> temporary tables.
>>> >
>>> > Hm ... what are the parallel queries exactly?  If you're doing
>>> something
>>> > like dropping both ends of a foreign-key linkage in parallel, I'd not
>>> be
>>> > very astonished by an error like this, especially not in 9.0.x.  It'd
>>> be
>>> > basically a race condition between two sessions both locking the same
>>> > table, but by the time the second one gets the lock, the first one has
>>> > dropped the table.  (Robert Haas has done some great work towards
>>> > eliminating this type of race condition lately, but it's sure not in
>>> > 9.0.x.)
>>>
>>> we can see same behave in 9.1
>>>
>>> when you try drop some tables in parallel sessions
>>>
>>>
> OK, so perhaps the difference is purely due to the use of postgres < 9.2
> on one db.
>

yes, I have not 9.2 now, but on 9.3 you get user friendly message

NOTICE:  table "foo" does not exist, skipping
DROP TABLE

Regards

Pavel


>
> -davidc
>
>
> --
> *David Clymer*
> VistaShare
>  866-828-4782, ext. 828
> www.VistaShare.com <http://www.vistashare.com/>
>
> [image: Facebook]   www.facebook.com/vistashare
> [image: Twitter]   www.twitter.com/vistashare
>

Re: Cache lookup failed for relation

From

Kevin Grittner

Date:

11 February 2013, 18:50:18

David Clymer <david.clymer@vistashare.com> wrote:=0A=0A> The SERIALIZABLE i=
solation mode is being used in 9.0, and=0A> REPEATABLE READ in 9.2, which s=
hould be the same thing, correct=0A> (eg. 9.0 serializable ~ 9.2 repeatable=
 read)?=0A=0ACorrect.=0A=0AIn 9.0 SERIALIZABLE and REPEATABLE READ are exac=
tly same.=A0 In 9.1=0Aand later REPEATABLE READ has not changed from 9.0, b=
ut=0ASERIALIZABLE has basically become REPEATABLE READ with extra=0Acheckin=
g for serialization failures.=0A=0A-Kevin=0A

Re: Cache lookup failed for relation

From

Péter Kovács

Date:

11 February 2013, 20:14:39

Thank you, Tom!

Recently, as I wanted to drop a schema (PG 8.2), I had to wade through a number of such messages and keep dropping rows in the pg_dependency table with the OID specified in the messages. When I was finally able to drop the schema, I wanted to drop the user (who previously owned the schema), but I got the same message again. I can imagine that something may have gone wrong in the schema at the application level, but how come I am getting the same messages when trying to drop a user?

Thanks,

Peter

On Mon, Feb 11, 2013 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Péter Kovács <peter.dunay.kovacs@gmail.com> writes:
> I'm confused by the error message. Is a cache miss an error condition?

Well, this isn't a "cache miss", it's more of a "there's no such OID in
the pg_class catalog" condition. Normally you see something more
user-friendly; but in the case of going to remove a cross-table linkage,
the code isn't expecting the other table to not be there, so you get a
pretty low-level error.

regards, tom lane