Thread: Cache lookup failed for relation
I've been seeing the following error in one database of ours: "cache lookup failed for relation 7640518" The SQL that apparently triggers this is: drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; Unfortunately, manual attempts to reproduce the issue have failed. In normal operation, this statement is run as one of several parallel queries, and the tables are by nature, short lived. That said, they are not temporary tables. This is one of two very similar databases, and we are running the same software (same version) on top of each. The databases are in different versions of postgresql. Db #1 is postgresql 9.2.3, and db #2 (the one exhibiting the above behavior) is postgresql 9.0.11. One other item of note: db #2 has recently had an OID wrap-around, which makes me suspect that plays some part in this behavior. I've looked at the caching code in postgresql, and though I cannot claim to have a thorough understanding of how it works, I have a theory. How the cache works (as I understand it): When a cache lookup is performed, tuples are looked up by OID. The cache contents are hashed into buckets. If an item is found in the cache, it is promoted to the top of the bucket so that subsequent searches are faster. If an item is not in the cache, it is looked up in the system catalog, and an entry is inserted into the cache. If a lookup in the catalog fails, a negative entry is added to the cache for the tuple. Multiple entries can exist for the same tuple. The latest one is just promoted to the top of the bucket, and the other gets aged out of the cache, since it is never again accessed. Theory: Given that we have wrapped around our OID counter, it is possible to have multiple entries in the cache for the same OID. If one relation is deleted, and a negative entry inserted into the cache, attempts to look up the other may erroneously produce a negative cache hit, yielding our "cache lookup failed for relation" error. Is this a possibility? Are there any other obvious explanation for this? The results from google related to this error seem to point to catalog corruption, or a postgres bug. Any pointers/enlightenment would be appreciated. -davidc -- *David Clymer* VistaShare 866-828-4782, ext. 828 www.VistaShare.com <http://www.vistashare.com/> [image: Facebook] www.facebook.com/vistashare [image: Twitter] www.twitter.com/vistashare
David Clymer <david.clymer@vistashare.com> writes: > I've been seeing the following error in one database of ours: > "cache lookup failed for relation 7640518" Always the same OID, or does it change? > The SQL that apparently triggers this is: > drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; > Unfortunately, manual attempts to reproduce the issue have failed. In > normal operation, this statement is run as one of several parallel queries, > and the tables are by nature, short lived. That said, they are not > temporary tables. Hm ... what are the parallel queries exactly? If you're doing something like dropping both ends of a foreign-key linkage in parallel, I'd not be very astonished by an error like this, especially not in 9.0.x. It'd be basically a race condition between two sessions both locking the same table, but by the time the second one gets the lock, the first one has dropped the table. (Robert Haas has done some great work towards eliminating this type of race condition lately, but it's sure not in 9.0.x.) > One other item of note: db #2 has recently had an OID wrap-around, which > makes me suspect that plays some part in this behavior. I don't believe that theory at all. regards, tom lane
<p dir="ltr">I'm confused by the error message. Is a cache miss an error condition?<p dir="ltr">Thanks<br /> Peter<div class="gmail_quote">OnFeb 11, 2013 6:22 PM, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<brtype="attribution" /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">David Clymer <<a href="mailto:david.clymer@vistashare.com">david.clymer@vistashare.com</a>>writes:<br /> > I've been seeing the followingerror in one database of ours:<br /> > "cache lookup failed for relation 7640518"<br /><br /> Always the sameOID, or does it change?<br /><br /> > The SQL that apparently triggers this is:<br /> > drop table if existsns_e5461ae570429d0b7863cce9ef4d4ead;<br /><br /> > Unfortunately, manual attempts to reproduce the issue have failed.In<br /> > normal operation, this statement is run as one of several parallel queries,<br /> > and the tablesare by nature, short lived. That said, they are not<br /> > temporary tables.<br /><br /> Hm ... what are the parallelqueries exactly? If you're doing something<br /> like dropping both ends of a foreign-key linkage in parallel, I'dnot be<br /> very astonished by an error like this, especially not in 9.0.x. It'd be<br /> basically a race conditionbetween two sessions both locking the same<br /> table, but by the time the second one gets the lock, the firstone has<br /> dropped the table. (Robert Haas has done some great work towards<br /> eliminating this type of racecondition lately, but it's sure not in<br /> 9.0.x.)<br /><br /> > One other item of note: db #2 has recently hadan OID wrap-around, which<br /> > makes me suspect that plays some part in this behavior.<br /><br /> I don't believethat theory at all.<br /><br /> regards, tom lane<br /><br /><br /> --<br /> Sent via pgsql-generalmailing list (<a href="mailto:pgsql-general@postgresql.org">pgsql-general@postgresql.org</a>)<br /> To makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-general" target="_blank">http://www.postgresql.org/mailpref/pgsql-general</a><br/></blockquote></div>
2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>: > David Clymer <david.clymer@vistashare.com> writes: >> I've been seeing the following error in one database of ours: >> "cache lookup failed for relation 7640518" > > Always the same OID, or does it change? > >> The SQL that apparently triggers this is: >> drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; > >> Unfortunately, manual attempts to reproduce the issue have failed. In >> normal operation, this statement is run as one of several parallel queries, >> and the tables are by nature, short lived. That said, they are not >> temporary tables. > > Hm ... what are the parallel queries exactly? If you're doing something > like dropping both ends of a foreign-key linkage in parallel, I'd not be > very astonished by an error like this, especially not in 9.0.x. It'd be > basically a race condition between two sessions both locking the same > table, but by the time the second one gets the lock, the first one has > dropped the table. (Robert Haas has done some great work towards > eliminating this type of race condition lately, but it's sure not in > 9.0.x.) we can see same behave in 9.1 when you try drop some tables in parallel sessions Regards Pavel Stehule > >> One other item of note: db #2 has recently had an OID wrap-around, which >> makes me suspect that plays some part in this behavior. > > I don't believe that theory at all. > > regards, tom lane > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Péter Kovács <peter.dunay.kovacs@gmail.com> writes: > I'm confused by the error message. Is a cache miss an error condition? Well, this isn't a "cache miss", it's more of a "there's no such OID in the pg_class catalog" condition. Normally you see something more user-friendly; but in the case of going to remove a cross-table linkage, the code isn't expecting the other table to not be there, so you get a pretty low-level error. regards, tom lane
On Mon, Feb 11, 2013 at 12:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Clymer <david.clymer@vistashare.com> writes: > > I've been seeing the following error in one database of ours: > > "cache lookup failed for relation 7640518" > > Always the same OID, or does it change? > It appears that almost all instances reference a different OID. > > > The SQL that apparently triggers this is: > > drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; > > > Unfortunately, manual attempts to reproduce the issue have failed. In > > normal operation, this statement is run as one of several parallel > queries, > > and the tables are by nature, short lived. That said, they are not > > temporary tables. > > Hm ... what are the parallel queries exactly? Sorry, that's our application level terminology. As far as postgres is concerned they are just individual queries running at the roughly same time. > If you're doing something > like dropping both ends of a foreign-key linkage in parallel, I'd not be > very astonished by an error like this, especially not in 9.0.x. It'd be > basically a race condition between two sessions both locking the same > table, but by the time the second one gets the lock, the first one has > dropped the table. (Robert Haas has done some great work towards > eliminating this type of race condition lately, but it's sure not in > 9.0.x.) > I don't think we are doing that, but it may be that two queries are attempting to drop the same table "if exists". I'll have to look at that a bit more. The SERIALIZABLE isolation mode is being used in 9.0, and REPEATABLE READ in 9.2, which should be the same thing, correct (eg. 9.0 serializable ~ 9.2 repeatable read)? > > One other item of note: db #2 has recently had an OID wrap-around, which > > makes me suspect that plays some part in this behavior. > > I don't believe that theory at all. > Good to know. -davidc -- *David Clymer* VistaShare 866-828-4782, ext. 828 www.VistaShare.com <http://www.vistashare.com/> [image: Facebook] www.facebook.com/vistashare [image: Twitter] www.twitter.com/vistashare
On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote: > 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>: > > David Clymer <david.clymer@vistashare.com> writes: > >> I've been seeing the following error in one database of ours: > >> "cache lookup failed for relation 7640518" > > > > Always the same OID, or does it change? > > > >> The SQL that apparently triggers this is: > >> drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; > > > >> Unfortunately, manual attempts to reproduce the issue have failed. In > >> normal operation, this statement is run as one of several parallel > queries, > >> and the tables are by nature, short lived. That said, they are not > >> temporary tables. > > > > Hm ... what are the parallel queries exactly? If you're doing something > > like dropping both ends of a foreign-key linkage in parallel, I'd not be > > very astonished by an error like this, especially not in 9.0.x. It'd be > > basically a race condition between two sessions both locking the same > > table, but by the time the second one gets the lock, the first one has > > dropped the table. (Robert Haas has done some great work towards > > eliminating this type of race condition lately, but it's sure not in > > 9.0.x.) > > we can see same behave in 9.1 > > when you try drop some tables in parallel sessions > > OK, -- *David Clymer* VistaShare 866-828-4782, ext. 828 www.VistaShare.com <http://www.vistashare.com/> [image: Facebook] www.facebook.com/vistashare [image: Twitter] www.twitter.com/vistashare
On Mon, Feb 11, 2013 at 1:13 PM, David Clymer <david.clymer@vistashare.com>wrote: > On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote: > >> 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>: >> > David Clymer <david.clymer@vistashare.com> writes: >> >> I've been seeing the following error in one database of ours: >> >> "cache lookup failed for relation 7640518" >> > >> > Always the same OID, or does it change? >> > >> >> The SQL that apparently triggers this is: >> >> drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; >> > >> >> Unfortunately, manual attempts to reproduce the issue have failed. In >> >> normal operation, this statement is run as one of several parallel >> queries, >> >> and the tables are by nature, short lived. That said, they are not >> >> temporary tables. >> > >> > Hm ... what are the parallel queries exactly? If you're doing something >> > like dropping both ends of a foreign-key linkage in parallel, I'd not be >> > very astonished by an error like this, especially not in 9.0.x. It'd be >> > basically a race condition between two sessions both locking the same >> > table, but by the time the second one gets the lock, the first one has >> > dropped the table. (Robert Haas has done some great work towards >> > eliminating this type of race condition lately, but it's sure not in >> > 9.0.x.) >> >> we can see same behave in 9.1 >> >> when you try drop some tables in parallel sessions >> >> OK, so perhaps the difference is purely due to the use of postgres < 9.2 on one db. -davidc -- *David Clymer* VistaShare 866-828-4782, ext. 828 www.VistaShare.com <http://www.vistashare.com/> [image: Facebook] www.facebook.com/vistashare [image: Twitter] www.twitter.com/vistashare
2013/2/11 David Clymer <david.clymer@vistashare.com> > On Mon, Feb 11, 2013 at 1:13 PM, David Clymer <david.clymer@vistashare.com > > wrote: > >> On Mon, Feb 11, 2013 at 12:47 PM, Pavel Stehule <pavel.stehule@gmail.com>wrote: >> >>> 2013/2/11 Tom Lane <tgl@sss.pgh.pa.us>: >>> > David Clymer <david.clymer@vistashare.com> writes: >>> >> I've been seeing the following error in one database of ours: >>> >> "cache lookup failed for relation 7640518" >>> > >>> > Always the same OID, or does it change? >>> > >>> >> The SQL that apparently triggers this is: >>> >> drop table if exists ns_e5461ae570429d0b7863cce9ef4d4ead; >>> > >>> >> Unfortunately, manual attempts to reproduce the issue have failed. In >>> >> normal operation, this statement is run as one of several parallel >>> queries, >>> >> and the tables are by nature, short lived. That said, they are not >>> >> temporary tables. >>> > >>> > Hm ... what are the parallel queries exactly? If you're doing >>> something >>> > like dropping both ends of a foreign-key linkage in parallel, I'd not >>> be >>> > very astonished by an error like this, especially not in 9.0.x. It'd >>> be >>> > basically a race condition between two sessions both locking the same >>> > table, but by the time the second one gets the lock, the first one has >>> > dropped the table. (Robert Haas has done some great work towards >>> > eliminating this type of race condition lately, but it's sure not in >>> > 9.0.x.) >>> >>> we can see same behave in 9.1 >>> >>> when you try drop some tables in parallel sessions >>> >>> > OK, so perhaps the difference is purely due to the use of postgres < 9.2 > on one db. > yes, I have not 9.2 now, but on 9.3 you get user friendly message NOTICE: table "foo" does not exist, skipping DROP TABLE Regards Pavel > > -davidc > > > -- > *David Clymer* > VistaShare > 866-828-4782, ext. 828 > www.VistaShare.com <http://www.vistashare.com/> > > [image: Facebook] www.facebook.com/vistashare > [image: Twitter] www.twitter.com/vistashare >
David Clymer <david.clymer@vistashare.com> wrote:=0A=0A> The SERIALIZABLE i= solation mode is being used in 9.0, and=0A> REPEATABLE READ in 9.2, which s= hould be the same thing, correct=0A> (eg. 9.0 serializable ~ 9.2 repeatable= read)?=0A=0ACorrect.=0A=0AIn 9.0 SERIALIZABLE and REPEATABLE READ are exac= tly same.=A0 In 9.1=0Aand later REPEATABLE READ has not changed from 9.0, b= ut=0ASERIALIZABLE has basically become REPEATABLE READ with extra=0Acheckin= g for serialization failures.=0A=0A-Kevin=0A
Thank you, Tom!
Recently, as I wanted to drop a schema (PG 8.2), I had to wade through a number of such messages and keep dropping rows in the pg_dependency table with the OID specified in the messages. When I was finally able to drop the schema, I wanted to drop the user (who previously owned the schema), but I got the same message again. I can imagine that something may have gone wrong in the schema at the application level, but how come I am getting the same messages when trying to drop a user?
Thanks,
Peter
On Mon, Feb 11, 2013 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Péter Kovács <peter.dunay.kovacs@gmail.com> writes:Well, this isn't a "cache miss", it's more of a "there's no such OID in
> I'm confused by the error message. Is a cache miss an error condition?
the pg_class catalog" condition. Normally you see something more
user-friendly; but in the case of going to remove a cross-table linkage,
the code isn't expecting the other table to not be there, so you get a
pretty low-level error.
regards, tom lane