Thread: Re: [GENERAL] cache lookup of relation 165058647 failed

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Sean Chittenden

Date:

04 May 2004, 02:33:45

> I'v find out that this error occurs in:
>  dependency.c file
>
> 2004-04-26 11:09:34 ERROR:  dependency.c 1621: cache lookup of relation
> 149064743 failed
> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>
> in getRelationDescription(StringInfo buffer, Oid relid) function.
>
> Any ideas what can cause this errors.

<aol>Me too.</aol>

But, I am suspecting that it's a race condition with the new background
writer code.  I've started testing a new database design and was able
to reproduce this on my laptop nearly 90% of the time, but could only
reproduce it about 10% of the time on my production databases until I
figured out what the difference was, fsync.

fsync was causing enough of a slow down that SearchSysCache() was
finding the tuple, whereas with fsync = false, it wasn't able to find
it.  But, in search of proving that it wasn't fsync (I use fsync =
false on my laptop to save my pour drive), I threw in a sleep in
between my tests, and I'm able to get things to work 100% of the time
by adding a sleep.  The following fails to work with fsync = false, 90%
of the time and with fsync = true, only 10% of the time.

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && psql -f test-end2.sql template1

But, if I change the command to:

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && sleep 1 && psql -f test-end2.sql
template1

I have no problems with cache relation misses.  As for what happens in
those commands, I'm:

-- 1) Dropping the test database and re-creating it
-- 2) In a different connection, load a rather large schema as the dba
-- 3) Connect again and create a temp table
-- 4) Connect a second time, and check to see if the temp table exists

The sleep comes at step 3.5 in the above sequence of operations.

*boom*  Here's a snippet of my terminal (the first thing I do after
BEGINning a transaction is create a temp table if it doesn't exist):

## BEGIN ##
[snip]
[...]
COMMIT
You are now connected to database "test" as user "usr".
BEGIN
psql:test-end2.sql:3: ERROR:  cache lookup failed for relation 398033
CONTEXT:  SQL query "SELECT  TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid)"
PL/pgSQL function "create_tmptbl" line 2 at perform
PL/pgSQL function "check_or_populate_func" line 8 at assignment
PL/pgSQL function "setuid_wrapper_func" line 5 at return
## END ##

What's really bothering me is I can push the up arrow on the console,
run the exact same thing (including dropping the database), and it'll
work sometimes.  Very disturbing.  As I said, I'm *very* suspicious of
the background writer goo that Jan added simply because I can't think
of anything else that'd have this problem.

I've run each of those commands 100 times now, with and without the
sleep 1.  With the sleep 1, it's worked 100% of the time.  Jan, any bit
of code that comes to mind?

All of my bgwriter_* settings are set to their default.

-sc

--
Sean Chittenden

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Tom Lane

Date:

04 May 2004, 02:57:06

Sean Chittenden <sean@chittenden.org> writes:
> But, I am suspecting that it's a race condition with the new background
> writer code.

Why?  Have you demonstrated that the failure does not occur in 7.4?

> psql:test-end2.sql:3: ERROR:  cache lookup failed for relation 398033
> CONTEXT:  SQL query "SELECT  TRUE FROM pg_catalog.pg_class c LEFT JOIN
> pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
> 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
> pg_catalog.pg_table_is_visible(c.oid)"

I think that pg_table_is_visible() will inspect the catalogs using
SnapshotNow semantics, while the above query will feed it with OIDs that
were valid under a start-of-query snapshot.  So I'd expect failures in
any recent PG version, if tables are being dynamically created/deleted
by concurrent transactions.

            regards, tom lane

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Sean Chittenden

Date:

04 May 2004, 04:07:25

[Renames thread from "The Pepsi Challenge" to "The PostgreSQL
Challenge"]

>> But, I am suspecting that it's a race condition with the new
>> background
>> writer code.
>
> Why?  Have you demonstrated that the failure does not occur in 7.4?

What other operations have been added to HEAD that would allow for
successful operation of sequential use or testing of temp tables?

More importantly, yes, I can confirm that this behavior doesn't exist
in REL7_4 as of 40min ago.  I've been running the exact same tests
repeatedly with nearly identical configs (as close as they can be given
the config changes) on REL7_4 as I have HEAD and only HEAD is giving me
problems.

>> psql:test-end2.sql:3: ERROR:  cache lookup failed for relation 398033
>> CONTEXT:  SQL query "SELECT  TRUE FROM pg_catalog.pg_class c LEFT JOIN
>> pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
>> 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
>> pg_catalog.pg_table_is_visible(c.oid)"
>
> I think that pg_table_is_visible() will inspect the catalogs using
> SnapshotNow semantics, while the above query will feed it with OIDs
> that
> were valid under a start-of-query snapshot.  So I'd expect failures in
> any recent PG version, if tables are being dynamically created/deleted
> by concurrent transactions.

There is no concurrency in the test I gave: it's all sequential.

0 | -- connect
1 | BEGIN;
2 | SELECT setuid_wrapper();
3 | -- Do other things;
4 | COMMIT;
5 | -- disconnect
6 | -- connect again
7 | BEGIN;
8 | SELECT setuid_wrapper();  -- This fails some % of the time

If I add step 5.5 that is a sleep, step 8 will work, without fail.  In
7.4, I have no problems, however.  That said, the bgwriter code is the
only think that I can think of that would muck with caching.  -sc


PS  Other comments temp schema permission patch?

--
Sean Chittenden

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Sean Chittenden

Date:

05 May 2004, 17:44:28

>>> I'v find out that this error occurs in:
>>>  dependency.c file
>>>
>>> 2004-04-26 11:09:34 ERROR:  dependency.c 1621: cache lookup of
>>> relation
>>> 149064743 failed
>>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>>>
>>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>>
>>> Any ideas what can cause this errors.
>> <aol>Me too.</aol>
>> But, I am suspecting that it's a race condition with the new
>> background writer code.  I've started testing a new database design
>> and was able to reproduce this on my laptop nearly 90% of the time,
>> but could only reproduce it about 10% of the time on my production
>> databases until I figured out what the difference was, fsync.
>
> temp tables don't use the shared buffer cache, how can this be related
> to the BG writer?

Don't the system catalogs use the shared buffer cache?

BEGIN;
SELECT create_temp_table_func();  -- Inserts a row into pg_class via
CREATE TEMP TABLE
-- Do other stuff
COMMIT;              -- After the commit, the row is now visible to other
backends
-- disconnect      -- If the delay between the disconnect and reconnect is
small enough
-- reconnect        -- It's as though there is a race condition that allows
the function
                -- pg_table_is_visible() to assert the "cache lookup of relation"
                -- error.
BEGIN;
SELECT create_temp_table_func();  -- Before the CREATE TEMP TABLE, I
call
                             /* SELECT TRUE FROM pg_catalog.pg_class c
                                LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
                                WHERE c.relname = ''footmp''::TEXT AND
                                c.relkind = ''r''::TEXT AND
                                pg_catalog.pg_table_is_visible(c.oid); */
                            -- But the query fails

My guess was that the series of events went something like:

proc 0) COMMIT's and the row in pg_class is committed
proc 1) bgwriter writer code removes a page for the cache
proc 2) queries for the page  [*]
proc 1) writes it to disk
proc 2) queries for the page  [*]
proc 1) sync's the fd

[*] proc 2 queries for the page at either of these points

In 7.4, there is no bgwriter or background process mucking with cache,
which is why this works 100% of the time.  In 7.5, however, there's a
200ms gap where a race condition appears and pg_table_is_visible()
fails its PointerIsValid() check.  If I put a sleep in, the sleep gives
the bgwriter enough time to commit the pages to disk so that the
queries for the page happen after the fd's been sync()'ed.

I have no other clue as to why this would be happening though, so
believe me when I say, I could very well be quite wrong.... but this is
my best, quasi-educated/grep(1)'ed guess.

-sc

--
Sean Chittenden

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Sean Chittenden

Date:

06 May 2004, 02:51:26

>>> temp tables don't use the shared buffer cache, how can this be
>>> related to the BG writer?
>> Don't the system catalogs use the shared buffer cache?
>> BEGIN;
>> SELECT create_temp_table_func();  -- Inserts a row into pg_class via
>> CREATE TEMP TABLE
>> -- Do other stuff
>> COMMIT;              -- After the commit, the row is now visible to other
>> backends
>> -- disconnect      -- If the delay between the disconnect and reconnect
>> is small enough
>> -- reconnect        -- It's as though there is a race condition that allows
>> the function
>>                 -- pg_table_is_visible() to assert the "cache lookup of relation"
>>                 -- error.
>> BEGIN;
>> SELECT create_temp_table_func();  -- Before the CREATE TEMP TABLE, I
>> call
>>                              /* SELECT TRUE FROM pg_catalog.pg_class c
>>                                 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
>>                                 WHERE c.relname = ''footmp''::TEXT AND
>>                                 c.relkind = ''r''::TEXT AND
>>                                 pg_catalog.pg_table_is_visible(c.oid); */
>>                             -- But the query fails
>> My guess was that the series of events went something like:
>> proc 0) COMMIT's and the row in pg_class is committed
>> proc 1) bgwriter writer code removes a page for the cache
>> proc 2) queries for the page  [*]
>> proc 1) writes it to disk
>> proc 2) queries for the page  [*]
>> proc 1) sync's the fd
>> [*] proc 2 queries for the page at either of these points
>> In 7.4, there is no bgwriter or background process mucking with cache,
>
> Except for the checkpoint process, which does exactly the same as the
> bgwriter does, and ALL concurrent backends whenever they feel the need
> to evict a dirty buffer.

Hrm...  well, haven't the slightest idea what would be causing this
then.  About all I can say is that some problem does exist in HEAD that
doesn't exist in REL7_4 that I'm able to tickle via temp tables.  :-/
Because this is time sensitive, what debugging foo could I insert to
get some useful diagnostic output?

> If it makes a difference if a pg_class page is dirty in the buffer or
> copied out to disk with respect to visibility rules of the tuples
> contained in it, then the whole thing is a way larger bug than the one
> in MIB. First of all, committed or not, a temp object from one session
> should NEVER be visible in any other.

Hrm...  well, I'm going to take my test scripts and reduce them down to
a test case.  For sure, there's something different in HEAD that's
causing problems that are time sensitive.  I've even thought about
grabbing my camera and making a low res 320x200 movie of the test
sequence.  I went and ran script(1) on one of the runs for the sake of
something.

http://sean.chittenden.org/postgresql/pgsql-create-temp-bug-
typescript.txt

Any help or assistance is greatly appreciated, I'm not sure where to go
from here.  -sc

--
Sean Chittenden

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Jan Wieck

Date:

06 May 2004, 20:12:18

Sean Chittenden wrote:
>> I'v find out that this error occurs in:
>>  dependency.c file
>>
>> 2004-04-26 11:09:34 ERROR:  dependency.c 1621: cache lookup of relation
>> 149064743 failed
>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>>
>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>
>> Any ideas what can cause this errors.
>
> <aol>Me too.</aol>
>
> But, I am suspecting that it's a race condition with the new background
> writer code.  I've started testing a new database design and was able
> to reproduce this on my laptop nearly 90% of the time, but could only
> reproduce it about 10% of the time on my production databases until I
> figured out what the difference was, fsync.

temp tables don't use the shared buffer cache, how can this be related
to the BG writer?


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [GENERAL] cache lookup of relation 165058647 failed

From

Jan Wieck

Date:

06 May 2004, 20:24:34

Sean Chittenden wrote:

>>>> I'v find out that this error occurs in:
>>>>  dependency.c file
>>>>
>>>> 2004-04-26 11:09:34 ERROR:  dependency.c 1621: cache lookup of
>>>> relation
>>>> 149064743 failed
>>>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>>>> 2004-04-26 11:09:34 ERROR:  Relation "tmp_table1" does not exist
>>>>
>>>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>>>
>>>> Any ideas what can cause this errors.
>>> <aol>Me too.</aol>
>>> But, I am suspecting that it's a race condition with the new
>>> background writer code.  I've started testing a new database design
>>> and was able to reproduce this on my laptop nearly 90% of the time,
>>> but could only reproduce it about 10% of the time on my production
>>> databases until I figured out what the difference was, fsync.
>>
>> temp tables don't use the shared buffer cache, how can this be related
>> to the BG writer?
>
> Don't the system catalogs use the shared buffer cache?
>
> BEGIN;
> SELECT create_temp_table_func();  -- Inserts a row into pg_class via
> CREATE TEMP TABLE
> -- Do other stuff
> COMMIT;              -- After the commit, the row is now visible to other
> backends
> -- disconnect      -- If the delay between the disconnect and reconnect is
> small enough
> -- reconnect        -- It's as though there is a race condition that allows
> the function
>                 -- pg_table_is_visible() to assert the "cache lookup of relation"
>                 -- error.
> BEGIN;
> SELECT create_temp_table_func();  -- Before the CREATE TEMP TABLE, I
> call
>                              /* SELECT TRUE FROM pg_catalog.pg_class c
>                                 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
>                                 WHERE c.relname = ''footmp''::TEXT AND
>                                 c.relkind = ''r''::TEXT AND
>                                 pg_catalog.pg_table_is_visible(c.oid); */
>                             -- But the query fails
>
> My guess was that the series of events went something like:
>
> proc 0) COMMIT's and the row in pg_class is committed
> proc 1) bgwriter writer code removes a page for the cache
> proc 2) queries for the page  [*]
> proc 1) writes it to disk
> proc 2) queries for the page  [*]
> proc 1) sync's the fd
>
> [*] proc 2 queries for the page at either of these points
>
> In 7.4, there is no bgwriter or background process mucking with cache,

Except for the checkpoint process, which does exactly the same as the
bgwriter does, and ALL concurrent backends whenever they feel the need
to evict a dirty buffer.

If it makes a difference if a pg_class page is dirty in the buffer or
copied out to disk with respect to visibility rules of the tuples
contained in it, then the whole thing is a way larger bug than the one
in MIB. First of all, committed or not, a temp object from one session
should NEVER be visible in any other.


Jan

> which is why this works 100% of the time.  In 7.5, however, there's a
> 200ms gap where a race condition appears and pg_table_is_visible()
> fails its PointerIsValid() check.  If I put a sleep in, the sleep gives
> the bgwriter enough time to commit the pages to disk so that the
> queries for the page happen after the fd's been sync()'ed.
>
> I have no other clue as to why this would be happening though, so
> believe me when I say, I could very well be quite wrong.... but this is
> my best, quasi-educated/grep(1)'ed guess.
>
> -sc
>


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #