Thread: Re: [GENERAL] cache lookup of relation 165058647 failed
> I'v find out that this error occurs in: > dependency.c file > > 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation > 149064743 failed > 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist > 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist > > in getRelationDescription(StringInfo buffer, Oid relid) function. > > Any ideas what can cause this errors. <aol>Me too.</aol> But, I am suspecting that it's a race condition with the new background writer code. I've started testing a new database design and was able to reproduce this on my laptop nearly 90% of the time, but could only reproduce it about 10% of the time on my production databases until I figured out what the difference was, fsync. fsync was causing enough of a slow down that SearchSysCache() was finding the tuple, whereas with fsync = false, it wasn't able to find it. But, in search of proving that it wasn't fsync (I use fsync = false on my laptop to save my pour drive), I threw in a sleep in between my tests, and I'm able to get things to work 100% of the time by adding a sleep. The following fails to work with fsync = false, 90% of the time and with fsync = true, only 10% of the time. % psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql && psql -f test-end1.sql template1 && psql -f test-end2.sql template1 But, if I change the command to: % psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql && psql -f test-end1.sql template1 && sleep 1 && psql -f test-end2.sql template1 I have no problems with cache relation misses. As for what happens in those commands, I'm: -- 1) Dropping the test database and re-creating it -- 2) In a different connection, load a rather large schema as the dba -- 3) Connect again and create a temp table -- 4) Connect a second time, and check to see if the temp table exists The sleep comes at step 3.5 in the above sequence of operations. *boom* Here's a snippet of my terminal (the first thing I do after BEGINning a transaction is create a temp table if it doesn't exist): ## BEGIN ## [snip] [...] COMMIT You are now connected to database "test" as user "usr". BEGIN psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033 CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname = 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND pg_catalog.pg_table_is_visible(c.oid)" PL/pgSQL function "create_tmptbl" line 2 at perform PL/pgSQL function "check_or_populate_func" line 8 at assignment PL/pgSQL function "setuid_wrapper_func" line 5 at return ## END ## What's really bothering me is I can push the up arrow on the console, run the exact same thing (including dropping the database), and it'll work sometimes. Very disturbing. As I said, I'm *very* suspicious of the background writer goo that Jan added simply because I can't think of anything else that'd have this problem. I've run each of those commands 100 times now, with and without the sleep 1. With the sleep 1, it's worked 100% of the time. Jan, any bit of code that comes to mind? All of my bgwriter_* settings are set to their default. -sc -- Sean Chittenden
Sean Chittenden <sean@chittenden.org> writes: > But, I am suspecting that it's a race condition with the new background > writer code. Why? Have you demonstrated that the failure does not occur in 7.4? > psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033 > CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN > pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname = > 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND > pg_catalog.pg_table_is_visible(c.oid)" I think that pg_table_is_visible() will inspect the catalogs using SnapshotNow semantics, while the above query will feed it with OIDs that were valid under a start-of-query snapshot. So I'd expect failures in any recent PG version, if tables are being dynamically created/deleted by concurrent transactions. regards, tom lane
[Renames thread from "The Pepsi Challenge" to "The PostgreSQL Challenge"] >> But, I am suspecting that it's a race condition with the new >> background >> writer code. > > Why? Have you demonstrated that the failure does not occur in 7.4? What other operations have been added to HEAD that would allow for successful operation of sequential use or testing of temp tables? More importantly, yes, I can confirm that this behavior doesn't exist in REL7_4 as of 40min ago. I've been running the exact same tests repeatedly with nearly identical configs (as close as they can be given the config changes) on REL7_4 as I have HEAD and only HEAD is giving me problems. >> psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033 >> CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN >> pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname = >> 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND >> pg_catalog.pg_table_is_visible(c.oid)" > > I think that pg_table_is_visible() will inspect the catalogs using > SnapshotNow semantics, while the above query will feed it with OIDs > that > were valid under a start-of-query snapshot. So I'd expect failures in > any recent PG version, if tables are being dynamically created/deleted > by concurrent transactions. There is no concurrency in the test I gave: it's all sequential. 0 | -- connect 1 | BEGIN; 2 | SELECT setuid_wrapper(); 3 | -- Do other things; 4 | COMMIT; 5 | -- disconnect 6 | -- connect again 7 | BEGIN; 8 | SELECT setuid_wrapper(); -- This fails some % of the time If I add step 5.5 that is a sleep, step 8 will work, without fail. In 7.4, I have no problems, however. That said, the bgwriter code is the only think that I can think of that would muck with caching. -sc PS Other comments temp schema permission patch? -- Sean Chittenden
>>> I'v find out that this error occurs in: >>> dependency.c file >>> >>> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of >>> relation >>> 149064743 failed >>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >>> >>> in getRelationDescription(StringInfo buffer, Oid relid) function. >>> >>> Any ideas what can cause this errors. >> <aol>Me too.</aol> >> But, I am suspecting that it's a race condition with the new >> background writer code. I've started testing a new database design >> and was able to reproduce this on my laptop nearly 90% of the time, >> but could only reproduce it about 10% of the time on my production >> databases until I figured out what the difference was, fsync. > > temp tables don't use the shared buffer cache, how can this be related > to the BG writer? Don't the system catalogs use the shared buffer cache? BEGIN; SELECT create_temp_table_func(); -- Inserts a row into pg_class via CREATE TEMP TABLE -- Do other stuff COMMIT; -- After the commit, the row is now visible to other backends -- disconnect -- If the delay between the disconnect and reconnect is small enough -- reconnect -- It's as though there is a race condition that allows the function -- pg_table_is_visible() to assert the "cache lookup of relation" -- error. BEGIN; SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I call /* SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname = ''footmp''::TEXT AND c.relkind = ''r''::TEXT AND pg_catalog.pg_table_is_visible(c.oid); */ -- But the query fails My guess was that the series of events went something like: proc 0) COMMIT's and the row in pg_class is committed proc 1) bgwriter writer code removes a page for the cache proc 2) queries for the page [*] proc 1) writes it to disk proc 2) queries for the page [*] proc 1) sync's the fd [*] proc 2 queries for the page at either of these points In 7.4, there is no bgwriter or background process mucking with cache, which is why this works 100% of the time. In 7.5, however, there's a 200ms gap where a race condition appears and pg_table_is_visible() fails its PointerIsValid() check. If I put a sleep in, the sleep gives the bgwriter enough time to commit the pages to disk so that the queries for the page happen after the fd's been sync()'ed. I have no other clue as to why this would be happening though, so believe me when I say, I could very well be quite wrong.... but this is my best, quasi-educated/grep(1)'ed guess. -sc -- Sean Chittenden
>>> temp tables don't use the shared buffer cache, how can this be >>> related to the BG writer? >> Don't the system catalogs use the shared buffer cache? >> BEGIN; >> SELECT create_temp_table_func(); -- Inserts a row into pg_class via >> CREATE TEMP TABLE >> -- Do other stuff >> COMMIT; -- After the commit, the row is now visible to other >> backends >> -- disconnect -- If the delay between the disconnect and reconnect >> is small enough >> -- reconnect -- It's as though there is a race condition that allows >> the function >> -- pg_table_is_visible() to assert the "cache lookup of relation" >> -- error. >> BEGIN; >> SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I >> call >> /* SELECT TRUE FROM pg_catalog.pg_class c >> LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace >> WHERE c.relname = ''footmp''::TEXT AND >> c.relkind = ''r''::TEXT AND >> pg_catalog.pg_table_is_visible(c.oid); */ >> -- But the query fails >> My guess was that the series of events went something like: >> proc 0) COMMIT's and the row in pg_class is committed >> proc 1) bgwriter writer code removes a page for the cache >> proc 2) queries for the page [*] >> proc 1) writes it to disk >> proc 2) queries for the page [*] >> proc 1) sync's the fd >> [*] proc 2 queries for the page at either of these points >> In 7.4, there is no bgwriter or background process mucking with cache, > > Except for the checkpoint process, which does exactly the same as the > bgwriter does, and ALL concurrent backends whenever they feel the need > to evict a dirty buffer. Hrm... well, haven't the slightest idea what would be causing this then. About all I can say is that some problem does exist in HEAD that doesn't exist in REL7_4 that I'm able to tickle via temp tables. :-/ Because this is time sensitive, what debugging foo could I insert to get some useful diagnostic output? > If it makes a difference if a pg_class page is dirty in the buffer or > copied out to disk with respect to visibility rules of the tuples > contained in it, then the whole thing is a way larger bug than the one > in MIB. First of all, committed or not, a temp object from one session > should NEVER be visible in any other. Hrm... well, I'm going to take my test scripts and reduce them down to a test case. For sure, there's something different in HEAD that's causing problems that are time sensitive. I've even thought about grabbing my camera and making a low res 320x200 movie of the test sequence. I went and ran script(1) on one of the runs for the sake of something. http://sean.chittenden.org/postgresql/pgsql-create-temp-bug- typescript.txt Any help or assistance is greatly appreciated, I'm not sure where to go from here. -sc -- Sean Chittenden
Sean Chittenden wrote: >> I'v find out that this error occurs in: >> dependency.c file >> >> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation >> 149064743 failed >> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >> >> in getRelationDescription(StringInfo buffer, Oid relid) function. >> >> Any ideas what can cause this errors. > > <aol>Me too.</aol> > > But, I am suspecting that it's a race condition with the new background > writer code. I've started testing a new database design and was able > to reproduce this on my laptop nearly 90% of the time, but could only > reproduce it about 10% of the time on my production databases until I > figured out what the difference was, fsync. temp tables don't use the shared buffer cache, how can this be related to the BG writer? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Sean Chittenden wrote: >>>> I'v find out that this error occurs in: >>>> dependency.c file >>>> >>>> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of >>>> relation >>>> 149064743 failed >>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist >>>> >>>> in getRelationDescription(StringInfo buffer, Oid relid) function. >>>> >>>> Any ideas what can cause this errors. >>> <aol>Me too.</aol> >>> But, I am suspecting that it's a race condition with the new >>> background writer code. I've started testing a new database design >>> and was able to reproduce this on my laptop nearly 90% of the time, >>> but could only reproduce it about 10% of the time on my production >>> databases until I figured out what the difference was, fsync. >> >> temp tables don't use the shared buffer cache, how can this be related >> to the BG writer? > > Don't the system catalogs use the shared buffer cache? > > BEGIN; > SELECT create_temp_table_func(); -- Inserts a row into pg_class via > CREATE TEMP TABLE > -- Do other stuff > COMMIT; -- After the commit, the row is now visible to other > backends > -- disconnect -- If the delay between the disconnect and reconnect is > small enough > -- reconnect -- It's as though there is a race condition that allows > the function > -- pg_table_is_visible() to assert the "cache lookup of relation" > -- error. > BEGIN; > SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I > call > /* SELECT TRUE FROM pg_catalog.pg_class c > LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace > WHERE c.relname = ''footmp''::TEXT AND > c.relkind = ''r''::TEXT AND > pg_catalog.pg_table_is_visible(c.oid); */ > -- But the query fails > > My guess was that the series of events went something like: > > proc 0) COMMIT's and the row in pg_class is committed > proc 1) bgwriter writer code removes a page for the cache > proc 2) queries for the page [*] > proc 1) writes it to disk > proc 2) queries for the page [*] > proc 1) sync's the fd > > [*] proc 2 queries for the page at either of these points > > In 7.4, there is no bgwriter or background process mucking with cache, Except for the checkpoint process, which does exactly the same as the bgwriter does, and ALL concurrent backends whenever they feel the need to evict a dirty buffer. If it makes a difference if a pg_class page is dirty in the buffer or copied out to disk with respect to visibility rules of the tuples contained in it, then the whole thing is a way larger bug than the one in MIB. First of all, committed or not, a temp object from one session should NEVER be visible in any other. Jan > which is why this works 100% of the time. In 7.5, however, there's a > 200ms gap where a race condition appears and pg_table_is_visible() > fails its PointerIsValid() check. If I put a sleep in, the sleep gives > the bgwriter enough time to commit the pages to disk so that the > queries for the page happen after the fd's been sync()'ed. > > I have no other clue as to why this would be happening though, so > believe me when I say, I could very well be quite wrong.... but this is > my best, quasi-educated/grep(1)'ed guess. > > -sc > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #