Thread: error: could not find pg_class tuple for index 2662
My client has been seeing regular instances of the following sort of problem: ...03:06:09.453 exec_simple_query, postgres.c:90003:06:12.042 XX000: could not find pg_class tuple for index 2662 at character1303:06:12.042 RelationReloadIndexInfo, relcache.c:174003:06:12.042 INSERT INTO zzz_k(k) SELECT ...03:06:12.04500000: statement: ABORT03:06:12.045 exec_simple_query, postgres.c:90003:06:12.045 00000: duration: 0.100 ms03:06:12.045exec_simple_query, postgres.c:112803:06:12.046 00000: statement: INSERT INTO temp_807 VALUES(...)03:06:12.046 exec_simple_query, postgres.c:90003:06:12.046 XX000: could not find pg_class tuple for index 2662at character 1303:06:12.046 RelationReloadIndexInfo, relcache.c:174003:06:12.046 INSERT INTO temp_807 VALUES (...)03:06:12.096 08P01: unexpected EOF on client connection03:06:12.096 SocketBackend, postgres.c:34803:06:12.096XX000: could not find pg_class tuple for index 266203:06:12.096 RelationReloadIndexInfo, relcache.c:174003:06:12.12100000: disconnection: session time: 0:06:08.537 user=ZZZ database=ZZZ_0103:06:12.121 log_disconnections,postgres.c:4339 The above happens regularly (but not completely predictably) corresponding with a daily cronjob that checks the catalogs for bloat and does vacuum full and/or reindex as needed. Since some of the applications make very heavy use of temp tables this will usually mean pg_class and pg_index get vacuum full and reindex. Sometimes queries will fail due to being unable to open a tables containing file. On investigation the file will be absent in both the catalogs and the filesystem so I don't know what table it refers to: 20:41:19.063 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory20:41:19.063 STATEMENT: insert into r_ar__30 select aid, mid, pid, sum(wdata) as wdata, ... --20:41:19.430 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory20:41:19.430 STATEMENT: SELECT nextval('j_id_seq') Finallly, I have seen a several instances of failure to read data by vacuum full itself: 03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index;03:05:45.699 exec_simple_query, postgres.c:90003:05:46.142XX001: could not read block 65 in file "pg_tblspc/16401/PG_9.0_201008051/16416/1049146489": readonly 0 of 8192 bytes03:05:46.142 mdread, md.c:65603:05:46.142 vacuum full pg_catalog.pg_index; This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have identical systems still running 8.4.8 that do not have this issue, so I'm assuming it is related to the vacuum full work done for 9.0. Oddly, we don't see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4, so it may be timing related. This seems possibly related to the issues in: Bizarre buildfarm failure on baiji: can't find pg_class_oid_index http://archives.postgresql.org/pgsql-hackers/2010-02/msg02038.phpBroken HOT chains in system catalogs http://archives.postgresql.org/pgsql-hackers/2011-04/msg00777.php As far as I can tell from the logs I have, once a session sees one of these errors any subsequent query will hit it again until the session exits. However, it does not seem to harm other sessions or leave any persistant damage (crossing fingers and hoping here). I'm ready to do any testing/investigation/instrumented builds etc that may be helpful in resolving this. Regards -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote: > My client has been seeing regular instances of the following sort of problem: On what version of PostgreSQL? > If simplicity worked, the world would be overrun with insects. I thought it was... :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote: > On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote: > > My client has been seeing regular instances of the following sort of problem: > On what version of PostgreSQL? 9.0.4. I previously said: > > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have > > identical systems still running 8.4.8 that do not have this issue, so I'm > > assuming it is related to the vacuum full work done for 9.0. Oddly, we don't > > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4, > > so it may be timing related. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Thu, Jul 28, 2011 at 5:46 PM, daveg <daveg@sonic.net> wrote: > On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote: >> On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote: >> > My client has been seeing regular instances of the following sort of problem: >> On what version of PostgreSQL? > > 9.0.4. > > I previously said: >> > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have >> > identical systems still running 8.4.8 that do not have this issue, so I'm >> > assuming it is related to the vacuum full work done for 9.0. Oddly, we don't >> > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4, >> > so it may be timing related. Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which means that a REINDEX is happening as part of the same operation. In 9.0, there's no point in doing VACUUM FULL immediately followed by REINDEX. My guess is that this is happening either right around the time the VACUUM FULL commits or right around the time the REINDEX commits. It'd be helpful to know which, if you can figure it out. If there's not a hardware problem causing those read errors, maybe a backend is somehow ending up with a stale or invalid relcache entry. I'm not sure exactly how that could be happening, though... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 5:46 PM, daveg <daveg@sonic.net> wrote: > > On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote: > >> On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote: > >> > My client has been seeing regular instances of the following sort of problem: > >> On what version of PostgreSQL? > > > > 9.0.4. > > > > I previously said: > >> > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have > >> > identical systems still running 8.4.8 that do not have this issue, so I'm > >> > assuming it is related to the vacuum full work done for 9.0. Oddly, we don't > >> > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4, > >> > so it may be timing related. > > Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which > means that a REINDEX is happening as part of the same operation. In > 9.0, there's no point in doing VACUUM FULL immediately followed by > REINDEX. My guess is that this is happening either right around the > time the VACUUM FULL commits or right around the time the REINDEX > commits. It'd be helpful to know which, if you can figure it out. I'll update my vacuum script to skip reindexes after vacuum full for 9.0 servers and see if that makes the problem go away. Thanks for reminding me that they are not needed. However, I suspect it is the vacuum, not the reindex causing the problem. I'll update when I know. > If there's not a hardware problem causing those read errors, maybe a > backend is somehow ending up with a stale or invalid relcache entry. > I'm not sure exactly how that could be happening, though... It does not appear to be a hardware problem. I also suspect it is a stale relcache. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote: >> Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which >> means that a REINDEX is happening as part of the same operation. In >> 9.0, there's no point in doing VACUUM FULL immediately followed by >> REINDEX. My guess is that this is happening either right around the >> time the VACUUM FULL commits or right around the time the REINDEX >> commits. It'd be helpful to know which, if you can figure it out. > I'll update my vacuum script to skip reindexes after vacuum full for 9.0 > servers and see if that makes the problem go away. The thing that was bizarre about the one instance in the buildfarm was that the error was persistent, ie, once a session had failed all its subsequent attempts to access pg_class failed too. I gather from Dave's description that it's working that way for him too. I can think of ways that there might be a transient race condition against a REINDEX, but it's very unclear why the failure would persist across multiple attempts. The best idea I can come up with is that the session has somehow cached a wrong commit status for the reindexing transaction, causing it to believe that both old and new copies of the index's pg_class row are dead ... but how could that happen? The underlying state in the catalog is not wrong, because no concurrent sessions are upset (at least not in the buildfarm case ... Dave, do you see more than one session doing this at a time?). regards, tom lane
On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > daveg <daveg@sonic.net> writes: >> On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote: >>> Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which >>> means that a REINDEX is happening as part of the same operation. In >>> 9.0, there's no point in doing VACUUM FULL immediately followed by >>> REINDEX. My guess is that this is happening either right around the >>> time the VACUUM FULL commits or right around the time the REINDEX >>> commits. It'd be helpful to know which, if you can figure it out. > >> I'll update my vacuum script to skip reindexes after vacuum full for 9.0 >> servers and see if that makes the problem go away. > > The thing that was bizarre about the one instance in the buildfarm was > that the error was persistent, ie, once a session had failed all its > subsequent attempts to access pg_class failed too. I gather from Dave's > description that it's working that way for him too. I can think of ways > that there might be a transient race condition against a REINDEX, but > it's very unclear why the failure would persist across multiple > attempts. The best idea I can come up with is that the session has > somehow cached a wrong commit status for the reindexing transaction, > causing it to believe that both old and new copies of the index's > pg_class row are dead ... but how could that happen? The underlying > state in the catalog is not wrong, because no concurrent sessions are > upset (at least not in the buildfarm case ... Dave, do you see more than > one session doing this at a time?). I was thinking more along the lines of a failure while processing a sinval message emitted by the REINDEX. The sinval message doesn't get fully processed and therefore we get confused about what the relfilenode is for pg_class. If that happened for any other relation, we could recover by scanning pg_class. But if it happens for pg_class or pg_class_oid_index, we're toast, because we can't scan them without knowing what relfilenode to open. Now that can't be quite right, because of course those are mapped relations... and I'm pretty sure there are some other holes in my line of reasoning too. Just thinking out loud... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The thing that was bizarre about the one instance in the buildfarm was >> that the error was persistent, ie, once a session had failed all its >> subsequent attempts to access pg_class failed too. > I was thinking more along the lines of a failure while processing a > sinval message emitted by the REINDEX. The sinval message doesn't get > fully processed and therefore we get confused about what the > relfilenode is for pg_class. If that happened for any other relation, > we could recover by scanning pg_class. But if it happens for pg_class > or pg_class_oid_index, we're toast, because we can't scan them without > knowing what relfilenode to open. Well, no, because the ScanPgRelation call is not failing internally. It's performing a seqscan of pg_class and not finding a matching tuple. You could hypothesize about maybe an sinval message got missed leading us to scan the old (pre-VAC-FULL) copy of pg_class, but that still wouldn't explain how come we can't find a valid-looking entry for pg_class_oid_index in it. Tis a puzzlement. regards, tom lane
On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The thing that was bizarre about the one instance in the buildfarm was >>> that the error was persistent, ie, once a session had failed all its >>> subsequent attempts to access pg_class failed too. > >> I was thinking more along the lines of a failure while processing a >> sinval message emitted by the REINDEX. The sinval message doesn't get >> fully processed and therefore we get confused about what the >> relfilenode is for pg_class. If that happened for any other relation, >> we could recover by scanning pg_class. But if it happens for pg_class >> or pg_class_oid_index, we're toast, because we can't scan them without >> knowing what relfilenode to open. > > Well, no, because the ScanPgRelation call is not failing internally. > It's performing a seqscan of pg_class and not finding a matching tuple. SnapshotNow race? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, no, because the ScanPgRelation call is not failing internally. >> It's performing a seqscan of pg_class and not finding a matching tuple. > SnapshotNow race? That's what I would have guessed to start with, except that then it wouldn't be a persistent failure. You'd get one, but then the next query would succeed. regards, tom lane
On Fri, Jul 29, 2011 at 09:55:46AM -0400, Tom Lane wrote: > The thing that was bizarre about the one instance in the buildfarm was > that the error was persistent, ie, once a session had failed all its > subsequent attempts to access pg_class failed too. I gather from Dave's > description that it's working that way for him too. I can think of ways > that there might be a transient race condition against a REINDEX, but > it's very unclear why the failure would persist across multiple > attempts. The best idea I can come up with is that the session has > somehow cached a wrong commit status for the reindexing transaction, > causing it to believe that both old and new copies of the index's > pg_class row are dead ... but how could that happen? The underlying It is definitely persistant. Once triggered the error occurs for any new statement until the session exits. > state in the catalog is not wrong, because no concurrent sessions are > upset (at least not in the buildfarm case ... Dave, do you see more than > one session doing this at a time?). It looks like it happens to multiple sessions so far as one can tell from the timestamps of the errors: timestamp sessionid error------------ ------------- ----------------------------------------------03:05:37.434 4e26a861.4a6dcould not find pg_class tuple for index 266203:05:37.434 4e26a861.4a6f could not find pg_class tuple for index2662 03:06:12.041 4e26a731.438e could not find pg_class tuple for index 2662 03:06:12.042 4e21b6a3.629b could not find pg_class tuple for index 266203:06:12.042 4e26a723.42ec could not find pg_classtuple for index 2662 at character 13 -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Thu, Jul 28, 2011 at 11:31:31PM -0700, daveg wrote: > On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote: > > REINDEX. My guess is that this is happening either right around the > > time the VACUUM FULL commits or right around the time the REINDEX > > commits. It'd be helpful to know which, if you can figure it out. > > I'll update my vacuum script to skip reindexes after vacuum full for 9.0 > servers and see if that makes the problem go away. Thanks for reminding > me that they are not needed. However, I suspect it is the vacuum, not the > reindex causing the problem. I'll update when I know. Here is the update: the problem happens with vacuum full alone, no reindex is needed to trigger it. I updated the script to avoid reindexing after vacuum. Over the past two days there are still many ocurrances of this error coincident with the vacuum. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > Here is the update: the problem happens with vacuum full alone, no reindex > is needed to trigger it. I updated the script to avoid reindexing after > vacuum. Over the past two days there are still many ocurrances of this > error coincident with the vacuum. Well, that jives with the assumption that the one case we saw in the buildfarm was the same thing, because the regression tests were certainly only doing a VACUUM FULL and not a REINDEX of pg_class. But it doesn't get us much closer to understanding what's happening. In particular, it seems to knock out most ideas associated with race conditions, because the VAC FULL should hold exclusive lock on pg_class until it's completely done (including index rebuilds). I think we need to start adding some instrumentation so we can get a better handle on what's going on in your database. If I were to send you a source-code patch for the server that adds some more logging printout when this happens, would you be willing/able to run a patched build on your machine? (BTW, just to be perfectly clear ... the "could not find pg_class tuple" errors always mention index 2662, right, never any other number?) regards, tom lane
On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > Here is the update: the problem happens with vacuum full alone, no reindex > > is needed to trigger it. I updated the script to avoid reindexing after > > vacuum. Over the past two days there are still many ocurrances of this > > error coincident with the vacuum. > > Well, that jives with the assumption that the one case we saw in > the buildfarm was the same thing, because the regression tests were > certainly only doing a VACUUM FULL and not a REINDEX of pg_class. > But it doesn't get us much closer to understanding what's happening. > In particular, it seems to knock out most ideas associated with race > conditions, because the VAC FULL should hold exclusive lock on pg_class > until it's completely done (including index rebuilds). > > I think we need to start adding some instrumentation so we can get a > better handle on what's going on in your database. If I were to send > you a source-code patch for the server that adds some more logging > printout when this happens, would you be willing/able to run a patched > build on your machine? Yes we can run an instrumented server so long as the instrumentation does not interfere with normal operation. However, scheduling downtime to switch binaries is difficult, and generally needs to be happen on a weekend, but sometimes can be expedited. I'll look into that. > (BTW, just to be perfectly clear ... the "could not find pg_class tuple" > errors always mention index 2662, right, never any other number?) Yes, only index 2662, never any other. I'm attaching a somewhat redacted log for two different databases on the same instance around the time of vacuum full of pg_class in each database. My observations so far are: - the error occurs at commit of vacuum full of pg_class - in these cases error hits autovacuum after it waited for a lock on pg_class - in these two cases there was a new process startup while the vacuum was running. Don't know if this is relevant. - while these hit autovacuum, the error does hit other processs (just not in these sessions). Unknown if autovacuum is a required component. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
Attachment
daveg <daveg@sonic.net> writes: > On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote: >> I think we need to start adding some instrumentation so we can get a >> better handle on what's going on in your database. If I were to send >> you a source-code patch for the server that adds some more logging >> printout when this happens, would you be willing/able to run a patched >> build on your machine? > Yes we can run an instrumented server so long as the instrumentation does > not interfere with normal operation. However, scheduling downtime to switch > binaries is difficult, and generally needs to be happen on a weekend, but > sometimes can be expedited. I'll look into that. OK, attached is a patch against 9.0 branch that will re-scan pg_class after a failure of this sort occurs, and log what it sees in the tuple header fields for each tuple for the target index. This should give us some useful information. It might be worthwhile for you to also log the results of select relname,pg_relation_filenode(oid) from pg_class where relname like 'pg_class%'; in your script that does VACUUM FULL, just before and after each time it vacuums pg_class. That will help in interpreting the relfilenodes in the log output. > My observations so far are: > - the error occurs at commit of vacuum full of pg_class > - in these cases error hits autovacuum after it waited for a lock on pg_class > - in these two cases there was a new process startup while the vacuum was > running. Don't know if this is relevant. Interesting. We'll want to know whether that happens every time. > - while these hit autovacuum, the error does hit other processs (just not in > these sessions). Unknown if autovacuum is a required component. Good question. Please consider setting log_autovacuum_min_duration = 0 so that the log also traces all autovacuum activity. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 81cea8b60406dffa7b5d278697ab5ad6cef6a3d8..d7957ba4e0055a4f9f7721e06b72cbeb0ddb350e 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** *** 32,37 **** --- 32,38 ---- #include "access/genam.h" #include "access/reloptions.h" + #include "access/relscan.h" #include "access/sysattr.h" #include "access/transam.h" #include "access/xact.h" *************** *** 64,69 **** --- 65,71 ---- #include "optimizer/var.h" #include "rewrite/rewriteDefine.h" #include "storage/fd.h" + #include "storage/bufmgr.h" #include "storage/lmgr.h" #include "storage/smgr.h" #include "utils/array.h" *************** ScanPgRelation(Oid targetRelId, bool ind *** 310,315 **** --- 312,379 ---- } /* + * ScanPgRelationDetailed + * + * Try to figure out why we failed to locate row for relation. + */ + static HeapTuple + ScanPgRelationDetailed(Oid targetRelId) + { + HeapTuple pg_class_tuple; + Relation pg_class_desc; + SysScanDesc pg_class_scan; + ScanKeyData key[1]; + + /* + * form a scan key + */ + ScanKeyInit(&key[0], + ObjectIdAttributeNumber, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(targetRelId)); + + /* + * Open pg_class and fetch tuples, forcing heap scan and disabling + * visibility checks. + */ + pg_class_desc = heap_open(RelationRelationId, AccessShareLock); + pg_class_scan = systable_beginscan(pg_class_desc, ClassOidIndexId, + false, + SnapshotAny, + 1, key); + + while (HeapTupleIsValid((pg_class_tuple = systable_getnext(pg_class_scan)))) + { + Buffer buf = pg_class_scan->scan->rs_cbuf; + bool valid; + + /* need buffer lock to call HeapTupleSatisfiesVisibility */ + LockBuffer(buf, BUFFER_LOCK_SHARE); + valid = HeapTupleSatisfiesVisibility(pg_class_tuple, + SnapshotNow, + buf); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + elog(LOG, "searching %u for pg_class tuple for index %u: found ctid (%u,%u), xmin %u, xmax %u, flags 0x%4x 0x%4x,valid %d", + pg_class_desc->rd_node.relNode, + targetRelId, + ItemPointerGetBlockNumber(&(pg_class_tuple->t_self)), + ItemPointerGetOffsetNumber(&(pg_class_tuple->t_self)), + HeapTupleHeaderGetXmin(pg_class_tuple->t_data), + HeapTupleHeaderGetXmax(pg_class_tuple->t_data), + pg_class_tuple->t_data->t_infomask, + pg_class_tuple->t_data->t_infomask2, + valid); + } + + /* all done */ + systable_endscan(pg_class_scan); + heap_close(pg_class_desc, AccessShareLock); + + return NULL; + } + + /* * AllocateRelationDesc * * This is used to allocate memory for a new relation descriptor *************** RelationReloadIndexInfo(Relation relatio *** 1737,1744 **** indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */ --- 1801,1815 ---- indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! { ! pg_class_tuple = ScanPgRelationDetailed(RelationGetRelid(relation)); ! if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); ! else ! elog(LOG, "could not find pg_class tuple for index %u, but succeeded on second try", ! RelationGetRelid(relation)); ! } relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */
On Mon, Aug 01, 2011 at 01:23:49PM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote: > >> I think we need to start adding some instrumentation so we can get a > >> better handle on what's going on in your database. If I were to send > >> you a source-code patch for the server that adds some more logging > >> printout when this happens, would you be willing/able to run a patched > >> build on your machine? > > > Yes we can run an instrumented server so long as the instrumentation does > > not interfere with normal operation. However, scheduling downtime to switch > > binaries is difficult, and generally needs to be happen on a weekend, but > > sometimes can be expedited. I'll look into that. > > OK, attached is a patch against 9.0 branch that will re-scan pg_class > after a failure of this sort occurs, and log what it sees in the tuple > header fields for each tuple for the target index. This should give us > some useful information. It might be worthwhile for you to also log the > results of > > select relname,pg_relation_filenode(oid) from pg_class > where relname like 'pg_class%'; > > in your script that does VACUUM FULL, just before and after each time it > vacuums pg_class. That will help in interpreting the relfilenodes in > the log output. We have installed the patch and have encountered the error as usual. However there is no additional output from the patch. I'm speculating that the pg_class scan in ScanPgRelationDetailed() fails to return tuples somehow. I have also been trying to trace it further by reading the code, but have not got any solid hypothesis yet. In the absence of any debugging output I've been trying to deduce the call tree leading to the original failure. So far it looks like this: RelationReloadIndexInfo(Relation) // Relation is 2662 and !rd_isvalid pg_class_tuple = ScanPgRelation(2662, indexOK=false) // returns NULL pg_class_desc = heap_open(1259, ACC_SHARE) r = relation_open(1259, ACC_SHARE)// locks oid, ensures RelationIsValid(r) r = RelationIdGetRelation(1259) r = RelationIdCacheLookup(1259) // assume success if !rd_isvalid: RelationClearRelation(r,true) RelationInitPhysicalAddr(r) // r is pg_class relcache -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > We have installed the patch and have encountered the error as usual. > However there is no additional output from the patch. I'm speculating > that the pg_class scan in ScanPgRelationDetailed() fails to return > tuples somehow. Evidently not, if it's not logging anything, but now the question is why. One possibility is that for some reason RelationGetNumberOfBlocks is persistently lying about the file size. (We've seen kernel bugs before that resulted in transiently wrong values, so this isn't totally beyond the realm of possibility.) Please try the attached patch, which extends the previous one to add a summary line including the number of blocks physically scanned by the seqscan. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 81cea8b60406dffa7b5d278697ab5ad6cef6a3d8..589513a2d35987b2202ce1162b326db2d358b6fa 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** *** 32,37 **** --- 32,38 ---- #include "access/genam.h" #include "access/reloptions.h" + #include "access/relscan.h" #include "access/sysattr.h" #include "access/transam.h" #include "access/xact.h" *************** *** 64,69 **** --- 65,71 ---- #include "optimizer/var.h" #include "rewrite/rewriteDefine.h" #include "storage/fd.h" + #include "storage/bufmgr.h" #include "storage/lmgr.h" #include "storage/smgr.h" #include "utils/array.h" *************** ScanPgRelation(Oid targetRelId, bool ind *** 310,315 **** --- 312,388 ---- } /* + * ScanPgRelationDetailed + * + * Try to figure out why we failed to locate row for relation. + */ + static HeapTuple + ScanPgRelationDetailed(Oid targetRelId) + { + HeapTuple pg_class_tuple; + Relation pg_class_desc; + SysScanDesc pg_class_scan; + ScanKeyData key[1]; + int count = 0; + + /* + * form a scan key + */ + ScanKeyInit(&key[0], + ObjectIdAttributeNumber, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(targetRelId)); + + /* + * Open pg_class and fetch tuples, forcing heap scan and disabling + * visibility checks. + */ + pg_class_desc = heap_open(RelationRelationId, AccessShareLock); + pg_class_scan = systable_beginscan(pg_class_desc, ClassOidIndexId, + false, + SnapshotAny, + 1, key); + + while (HeapTupleIsValid((pg_class_tuple = systable_getnext(pg_class_scan)))) + { + Buffer buf = pg_class_scan->scan->rs_cbuf; + bool valid; + + count++; + + /* need buffer lock to call HeapTupleSatisfiesVisibility */ + LockBuffer(buf, BUFFER_LOCK_SHARE); + valid = HeapTupleSatisfiesVisibility(pg_class_tuple, + SnapshotNow, + buf); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + elog(LOG, "searching %u for pg_class tuple for index %u: found ctid (%u,%u), xmin %u, xmax %u, flags 0x%4x 0x%4x,valid %d", + pg_class_desc->rd_node.relNode, + targetRelId, + ItemPointerGetBlockNumber(&(pg_class_tuple->t_self)), + ItemPointerGetOffsetNumber(&(pg_class_tuple->t_self)), + HeapTupleHeaderGetXmin(pg_class_tuple->t_data), + HeapTupleHeaderGetXmax(pg_class_tuple->t_data), + pg_class_tuple->t_data->t_infomask, + pg_class_tuple->t_data->t_infomask2, + valid); + } + + elog(LOG, "ScanPgRelationDetailed: found %d tuples with OID %u in %u blocks of filenode %u", + count, + targetRelId, + pg_class_scan->scan->rs_nblocks, + pg_class_desc->rd_node.relNode); + + /* all done */ + systable_endscan(pg_class_scan); + heap_close(pg_class_desc, AccessShareLock); + + return NULL; + } + + /* * AllocateRelationDesc * * This is used to allocate memory for a new relation descriptor *************** RelationReloadIndexInfo(Relation relatio *** 1737,1744 **** indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */ --- 1810,1824 ---- indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! { ! pg_class_tuple = ScanPgRelationDetailed(RelationGetRelid(relation)); ! if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); ! else ! elog(LOG, "could not find pg_class tuple for index %u, but succeeded on second try", ! RelationGetRelid(relation)); ! } relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */
On Wed, Aug 03, 2011 at 11:18:20AM -0400, Tom Lane wrote: > Evidently not, if it's not logging anything, but now the question is > why. One possibility is that for some reason RelationGetNumberOfBlocks > is persistently lying about the file size. (We've seen kernel bugs > before that resulted in transiently wrong values, so this isn't totally > beyond the realm of possibility.) Please try the attached patch, which > extends the previous one to add a summary line including the number of > blocks physically scanned by the seqscan. Ok, I have results from the latest patch and have attached a redacted server log with the select relfilenode output added inline. This is the shorter of the logs and shows the sequence pretty clearly. I have additional logs if wanted. Summary: the failing process reads 0 rows from 0 blocks from the OLD relfilenode. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
Attachment
daveg <daveg@sonic.net> writes: > Summary: the failing process reads 0 rows from 0 blocks from the OLD > relfilenode. Hmm. This seems to mean that we're somehow missing a relation mapping invalidation message, or perhaps not processing it soon enough during some complex set of invalidations. I did some testing with that in mind but couldn't reproduce the failure. It'd be awfully nice to get a look at the call stack when this happens for you ... what OS are you running? regards, tom lane
On Thu, Aug 04, 2011 at 12:28:31PM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > Summary: the failing process reads 0 rows from 0 blocks from the OLD > > relfilenode. > > Hmm. This seems to mean that we're somehow missing a relation mapping > invalidation message, or perhaps not processing it soon enough during > some complex set of invalidations. I did some testing with that in mind > but couldn't reproduce the failure. It'd be awfully nice to get a look > at the call stack when this happens for you ... what OS are you running? cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.5 (Tikanga) Linux version 2.6.18-194.el5 I can use gdb as well if we can get a core or stop the correct process. Perhaps a long sleep when it hits this? Or perhaps we could log invalidate processing for pg_class? -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Thu, Aug 04, 2011 at 12:28:31PM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > Summary: the failing process reads 0 rows from 0 blocks from the OLD > > relfilenode. > > Hmm. This seems to mean that we're somehow missing a relation mapping > invalidation message, or perhaps not processing it soon enough during > some complex set of invalidations. I did some testing with that in mind > but couldn't reproduce the failure. It'd be awfully nice to get a look > at the call stack when this happens for you ... what OS are you running? To recap, a few observations: When it happens the victim has recently been waiting on a lock for a several seconds. We create a lot of temp tables, hundreds of thousands a day. There are catalog vacuum fulls and reindexes running on 30 odd other databases at the same time. The script estimates the amount of bloat on each table and index and chooses either reindex on specific indexes or vacuum full as needed. This is a 32 core (64 with hype threading) 512GB host with several hundred connections We are seeing "cannot read' and 'cannot open' errors too that would be consistant with trying to use a vanished file. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > We are seeing "cannot read' and 'cannot open' errors too that would be > consistant with trying to use a vanished file. Yeah, these all seem consistent with the idea that the failing backend somehow missed an update for the relation mapping file. You would get the "could not find pg_class tuple" syndrome if the process was holding an open file descriptor for the now-deleted file, and otherwise cannot open/cannot read type errors. And unless it later received another sinval message for the relation mapping file, the errors would persist. If this theory is correct then all of the file-related errors ought to match up to recently-vacuumed mapped catalogs or indexes (those are the ones with relfilenode = 0 in pg_class). Do you want to expand your logging of the VACUUM FULL actions and see if you can confirm that idea? Since the machine is running RHEL, I think we can use glibc's backtrace() function to get simple stack traces without too much effort. I'll write and test a patch and send it along in a bit. regards, tom lane
On Thu, Aug 04, 2011 at 04:16:08PM -0400, Tom Lane wrote: > daveg <daveg@sonic.net> writes: > > We are seeing "cannot read' and 'cannot open' errors too that would be > > consistant with trying to use a vanished file. > > Yeah, these all seem consistent with the idea that the failing backend > somehow missed an update for the relation mapping file. You would get > the "could not find pg_class tuple" syndrome if the process was holding > an open file descriptor for the now-deleted file, and otherwise cannot > open/cannot read type errors. And unless it later received another > sinval message for the relation mapping file, the errors would persist. > > If this theory is correct then all of the file-related errors ought to > match up to recently-vacuumed mapped catalogs or indexes (those are the > ones with relfilenode = 0 in pg_class). Do you want to expand your > logging of the VACUUM FULL actions and see if you can confirm that idea? At your service, what would you like to see? > Since the machine is running RHEL, I think we can use glibc's > backtrace() function to get simple stack traces without too much effort. > I'll write and test a patch and send it along in a bit. Great. Any point to try to capture SI events somehow? -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > On Thu, Aug 04, 2011 at 04:16:08PM -0400, Tom Lane wrote: >> If this theory is correct then all of the file-related errors ought to >> match up to recently-vacuumed mapped catalogs or indexes (those are the >> ones with relfilenode = 0 in pg_class). Do you want to expand your >> logging of the VACUUM FULL actions and see if you can confirm that idea? > At your service, what would you like to see? I was thinking log the before-and-after filenode values each time you do a VACUUM FULL, and then go through the logs to see if all the file-related complaints refer to recently obsoleted filenodes. regards, tom lane
Ahh ... you know what, never mind about stack traces, let's just see if the attached patch doesn't fix it. I still haven't reproduced the behavior here, but I think I see what must be happening: we are getting an sinval reset while attempting to open pg_class_oid_index. The latter condition causes its refcount to be above 1, which will cause RelationClearRelation to directly call RelationReloadIndexInfo, which enables the following sequence of calls: RelationCacheInvalidate -> RelationClearRelation -> RelationReloadIndexInfo. And the problem is that RelationCacheInvalidate intentionally forces pg_class_oid_index to be updated first. That was okay when the code was written, because the relcache entry for pg_class itself never really needed any updates. Now it does, so we have to make sure pg_class gets updated first, *then* pg_class_oid_index (which might result in a seqscan of pg_class), then everything else (for which we'll try to use pg_class_oid_index to locate their pg_class tuples). regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 81cea8b..0e4b17c 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** RelationCacheInvalidate(void) *** 2185,2204 **** { /* * Add this entry to list of stuff to rebuild in second pass. ! * pg_class_oid_index goes on the front of rebuildFirstList, other ! * nailed indexes on the back, and everything else into ! * rebuildList (in no particular order). */ ! if (relation->rd_isnailed && ! relation->rd_rel->relkind == RELKIND_INDEX) { if (RelationGetRelid(relation) == ClassOidIndexId) - rebuildFirstList = lcons(relation, rebuildFirstList); - else rebuildFirstList = lappend(rebuildFirstList, relation); } else ! rebuildList = lcons(relation, rebuildList); } } --- 2185,2207 ---- { /* * Add this entry to list of stuff to rebuild in second pass. ! * pg_class goes on the front of rebuildFirstList, ! * pg_class_oid_index goes to the back of rebuildFirstList, other ! * nailed indexes go on the front of rebuildList, and everything ! * else goes to the back of rebuildList. */ ! if (RelationGetRelid(relation) == RelationRelationId) ! rebuildFirstList = lcons(relation, rebuildFirstList); ! else if (relation->rd_isnailed && ! relation->rd_rel->relkind == RELKIND_INDEX) { if (RelationGetRelid(relation) == ClassOidIndexId) rebuildFirstList = lappend(rebuildFirstList, relation); + else + rebuildList = lcons(relation, rebuildList); } else ! rebuildList = lappend(rebuildList, relation); } }
I wrote: > Ahh ... you know what, never mind about stack traces, let's just see if > the attached patch doesn't fix it. On reflection, that patch would only fix the issue for pg_class, and that's not the only catalog that gets consulted during relcache reloads. I think we'd better do it as attached, instead. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 81cea8b..337300f 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** RelationCacheInvalidate(void) *** 2159,2164 **** --- 2159,2169 ---- List *rebuildList = NIL; ListCell *l; + /* + * Reload relation mapping data before starting to reconstruct cache. + */ + RelationMapInvalidateAll(); + /* Phase 1 */ hash_seq_init(&status, RelationIdCache); *************** RelationCacheInvalidate(void) *** 2184,2189 **** --- 2189,2204 ---- else { /* + * If it's a mapped relation, immediately update its rd_node in + * case its relfilenode changed. We must do this during phase 1 + * in case the relation is consulted during rebuild of other + * relcache entries in phase 2. It's safe since consulting the + * map doesn't involve any access to relcache entries. + */ + if (RelationIsMapped(relation)) + RelationInitPhysicalAddr(relation); + + /* * Add this entry to list of stuff to rebuild in second pass. * pg_class_oid_index goes on the front of rebuildFirstList, other * nailed indexes on the back, and everything else into *************** RelationCacheInvalidate(void) *** 2209,2219 **** */ smgrcloseall(); - /* - * Reload relation mapping data before starting to reconstruct cache. - */ - RelationMapInvalidateAll(); - /* Phase 2: rebuild the items found to need rebuild in phase 1 */ foreach(l, rebuildFirstList) { --- 2224,2229 ----
On Fri, Aug 05, 2011 at 12:10:31PM -0400, Tom Lane wrote: > I wrote: > > Ahh ... you know what, never mind about stack traces, let's just see if > > the attached patch doesn't fix it. > > On reflection, that patch would only fix the issue for pg_class, and > that's not the only catalog that gets consulted during relcache reloads. > I think we'd better do it as attached, instead. > > regards, tom lane Should this be applied in addition to the earlier patch, or to replace it? -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > Should this be applied in addition to the earlier patch, or to replace it? Apply it instead of the earlier one. regards, tom lane
I wrote: > I still haven't reproduced the behavior here, but I think I see what > must be happening: we are getting an sinval reset while attempting to > open pg_class_oid_index. After a number of false starts, I've managed to reproduce this behavior locally. The above theory turns out to be wrong, or at least incomplete. In order to be opening pg_class_oid_index, we must already have opened and locked pg_class, and we would have absorbed the relmapping update for pg_class when we did that, so an sinval reset during a subsequent relation open is too late to provoke the bug. Rather, the key to the problem is that the sinval reset has to happen *instead of* seeing the plain relcache inval on pg_class that was emitted by the "VACUUM FULL pg_class" command. In this case, we enter RelationCacheInvalidate with a stale value for pg_class's relmapping, and since that routine is not sufficiently careful about the order in which it revalidates the nailed-relation cache entries, we may try to read something from pg_class before we've updated pg_class's relmapping. So the reason Dave is seeing the problem a lot must be that he has very high sinval traffic, leading to lots of resets, and that increases the probability of a reset replacing just the wrong sinval message. I can reproduce the problem fairly conveniently with this crude hack: diff --git a/src/backend/storage/ipc/sinval.c b/src/backend/storage/ipc/sinval.c index 8499615..5ad2aee 100644 *** a/src/backend/storage/ipc/sinval.c --- b/src/backend/storage/ipc/sinval.c *************** ReceiveSharedInvalidMessages( *** 106,112 **** /* Try to get some more messages */ getResult = SIGetDataEntries(messages, MAXINVALMSGS); ! if (getResult < 0) { /* got a reset message */ elog(DEBUG4, "cache state reset"); --- 106,112 ---- /* Try to get some more messages */ getResult = SIGetDataEntries(messages, MAXINVALMSGS); ! if (getResult != 0) { /* got a reset message */ elog(DEBUG4, "cache state reset"); which forces every occurrence of an incoming sinval message to be treated as a reset. The serial regression tests still work with that, but they fall over almost immediately if you run something like this in parallel: while psql -c "vacuum full pg_class" regression; do usleep 100000; done and the parallel regression tests tend to fall over without any outside help because of the one occurrence of "vacuum full pg_class" in them. I'm inclined to think that a robust solution requires both of the ideas I proposed last week: we should update all the relmapping entries during pass 1 of RelationCacheInvalidate, *and* make pass 2 do things in a more robust order. I think pg_class, pg_class_oid_index, other nailed relations, and then everything else ought to do it. Anyway, that's easily fixed now that we have a reproduceable case. What's bothering me at the moment is that the CLOBBER_CACHE_ALWAYS hack, which was meant to expose exactly this sort of problem, failed to do so --- buildfarm member jaguar has been running with that flag for ages and never showed this problem. I'm thinking that we should take out the hack in AcceptInvalidationMessages and instead put in #ifdeffed code that causes ReceiveSharedInvalidMessages to forcibly always call the reset function. Any thoughts about that? regards, tom lane
On Thu, Aug 11, 2011 at 5:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I can reproduce the problem fairly conveniently with this crude hack: > > diff --git a/src/backend/storage/ipc/sinval.c b/src/backend/storage/ipc/sinval.c > index 8499615..5ad2aee 100644 > *** a/src/backend/storage/ipc/sinval.c > --- b/src/backend/storage/ipc/sinval.c > *************** ReceiveSharedInvalidMessages( > *** 106,112 **** > /* Try to get some more messages */ > getResult = SIGetDataEntries(messages, MAXINVALMSGS); > > ! if (getResult < 0) > { > /* got a reset message */ > elog(DEBUG4, "cache state reset"); > --- 106,112 ---- > /* Try to get some more messages */ > getResult = SIGetDataEntries(messages, MAXINVALMSGS); > > ! if (getResult != 0) > { > /* got a reset message */ > elog(DEBUG4, "cache state reset"); > > which forces every occurrence of an incoming sinval message to be treated > as a reset. The serial regression tests still work with that, but they > fall over almost immediately if you run something like this in parallel: > > while psql -c "vacuum full pg_class" regression; do usleep 100000; done > > and the parallel regression tests tend to fall over without any outside > help because of the one occurrence of "vacuum full pg_class" in them. Nice detective work. > I'm inclined to think that a robust solution requires both of the ideas > I proposed last week: we should update all the relmapping entries during > pass 1 of RelationCacheInvalidate, *and* make pass 2 do things in a more > robust order. I think pg_class, pg_class_oid_index, other nailed > relations, and then everything else ought to do it. > > Anyway, that's easily fixed now that we have a reproduceable case. > What's bothering me at the moment is that the CLOBBER_CACHE_ALWAYS hack, > which was meant to expose exactly this sort of problem, failed to do so > --- buildfarm member jaguar has been running with that flag for ages and > never showed this problem. I'm thinking that we should take out the > hack in AcceptInvalidationMessages and instead put in #ifdeffed code > that causes ReceiveSharedInvalidMessages to forcibly always call the > reset function. Any thoughts about that? I'm OK with that, but while we're whacking this around, is there any chance that we could reduce the number of layers of function calls that are happening in here? Right now, we call AcceptInvalidationMessages(). ...which basically does nothing but call ReceiveSharedInvalidMessages() ...which in turn calls SIGetDataEntries() ... and then, by way of a function pointer, LocalExecuteInvalidMessage() or perhaps InvalidateSystemCaches(). It would be nice to move the short-circuit test I recently inserted at the top of SIGetDataEntries() somewhere higher up in the call stack, but right now the layers of abstraction are so thick that it's not exactly clear how to do that. In my ideal world, I'd like to turn AcceptInvalidationMessages into something like this: #define AcceptInvalidationMessages() \ do { if (*flagvar) ReallyDoIt(); } while (0) ...where flagvar is a pointer to myStateP->hasMessages. This isn't really a huge hotspot any more, but a cycle saved is a cycle earned, and we call this frequently enough that it seems awfully wasteful to push and pop three call stack frames every time we check and find no messages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[adding back hackers so the thread shows the resolution] On Sun, Aug 14, 2011 at 07:02:55PM -0400, Tom Lane wrote: > Sounds good. Based on my own testing so far, I think that patch will > probably make things measurably better for you, though it won't resolve > every corner case. The most recent catalog vacuums did vacuum full pg_class in 34 databases on that instance with no new failures. So it looks like the patch fixes it. This is not conclusive, but it looks very good so far. I'll send an update if I see any new errors during the week. Thanks for your help on this. It looks like it has sent you on a merry search through all the catcache related program activities. I'm assuming this patch or some improvement on it will show up in a point release. Meanwhile, if this works as is for couple more days we will resume upgrading the rest of the hosts to 9.0 using this patch. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 11, 2011 at 5:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What's bothering me at the moment is that the CLOBBER_CACHE_ALWAYS hack, >> which was meant to expose exactly this sort of problem, failed to do so >> --- buildfarm member jaguar has been running with that flag for ages and >> never showed this problem. �I'm thinking that we should take out the >> hack in AcceptInvalidationMessages and instead put in #ifdeffed code >> that causes ReceiveSharedInvalidMessages to forcibly always call the >> reset function. �Any thoughts about that? > I'm OK with that, but while we're whacking this around, is there any > chance that we could reduce the number of layers of function calls > that are happening in here? I'm not excited about that. We have got three layers that each have significant complexities of their own to deal with. The only way to reduce the layer count is to make some piece of code deal with more of those complexities at once. I don't want to go there, especially not in the name of very marginal cycle-shaving. > It would be nice to move the short-circuit test I recently inserted at > the top of SIGetDataEntries() somewhere higher up in the call stack, > but right now the layers of abstraction are so thick that it's not > exactly clear how to do that. Glad you didn't try, because it would be wrong. The fact that there are no outstanding messages in sinvaladt.c doesn't prove there are none yet unprocessed inside ReceiveSharedInvalidMessages. Anyway, to get back to the original point: I'm getting less excited about redoing the CLOBBER_CACHE_ALWAYS code at a different level. The only thing that can really promise is that we find places outside the cache code that are mistakenly holding onto cache entry pointers. It can't do very much for the sort of problems I've been finding over the past week, first because you need some other process actively changing the underlying information (else the use of a stale cache entry isn't a problem), and second because when you're looking for bugs involving use of stale cache entries, over-enthusiastic flushing of the cache can actually mask the bugs. I'd still like to think of a better test methodology, but I don't think "force clobbers inside ReceiveSharedInvalidMessages" is it. regards, tom lane
On Tue, Aug 16, 2011 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It would be nice to move the short-circuit test I recently inserted at >> the top of SIGetDataEntries() somewhere higher up in the call stack, >> but right now the layers of abstraction are so thick that it's not >> exactly clear how to do that. > > Glad you didn't try, because it would be wrong. The fact that there > are no outstanding messages in sinvaladt.c doesn't prove there are none > yet unprocessed inside ReceiveSharedInvalidMessages. Oh. Good point. > Anyway, to get back to the original point: I'm getting less excited > about redoing the CLOBBER_CACHE_ALWAYS code at a different level. > The only thing that can really promise is that we find places outside > the cache code that are mistakenly holding onto cache entry pointers. > It can't do very much for the sort of problems I've been finding over > the past week, first because you need some other process actively > changing the underlying information (else the use of a stale cache entry > isn't a problem), and second because when you're looking for bugs > involving use of stale cache entries, over-enthusiastic flushing of the > cache can actually mask the bugs. > > I'd still like to think of a better test methodology, but I don't think > "force clobbers inside ReceiveSharedInvalidMessages" is it. Makes sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company