Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used - Mailing list pgsql-bugs
From | Alexander Lakhin |
---|---|
Subject | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used |
Date | |
Msg-id | 00e24aa6-5d03-99b6-ab41-495a35fc93f6@gmail.com Whole thread Raw |
In response to | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
|
List | pgsql-bugs |
Hello Tom, Thank you for your considerations! 21.07.2023 20:20, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> I think that we need to determine the level where the problem that should >> be fixed is: >> 1) test xmlmap fails sporadically due to the catalog changes caused by >> parallel tests activity >> 2) schema_to_xmlschemaX() can fail when parallel workers are used >> 3) has_table_privilegeX() can fail sporadically when executed within a >> parallel worker >> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot, >> when repeated in a parallel worker > Yeah, that's not immediately obvious. IIUC, the situation we are > looking at is that SearchSysCacheExists can succeed even though the > tuple we found is already dead at the instant that the function > exits (thanks to absorption of inval messages during relation_close). > The fact that that only happens in parallel workers is pure chance > really. It is not okay for has_table_privilegeX to depend on the > fact that the surrounding query already has some lock on pg_class. > So this means that the approach has_table_privilegeX uses of > assuming that successful SearchSysCacheExists means it can call > pg_class_aclcheck without fear is just broken. > > If we suppose that that assumption is only being made in the > has_foo_privilege functions, then one way we could fix it is to extend > the API of pg_class_aclcheck etc to add a no-error-on-not-found flag, > and get rid of the separate SearchSysCacheExists check. However, > I can't avoid the suspicion that we have other places assuming the > same thing. Running through SearchSysCacheExistsX() calls, I've found an interesting (and optimistic) comment in src/backend/catalog/namespace.c: * ... The underlying IsVisible functions * always use an up-to-date snapshot and so might see the object as already * gone when it's still visible to the transaction snapshot. (There is no race * condition in the current coding because we don't accept sinval messages * between the SearchSysCacheExists test and the subsequent lookup.) */ Datum pg_table_is_visible(PG_FUNCTION_ARGS) { Oid oid = PG_GETARG_OID(0); if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid))) PG_RETURN_NULL(); PG_RETURN_BOOL(RelationIsVisible(oid)); } ... bool RelationIsVisible(Oid relid) { HeapTuple reltup; Form_pg_class relform; Oid relnamespace; bool visible; reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); if (!HeapTupleIsValid(reltup)) elog(ERROR, "cache lookup failed for relation %u", relid); So that's exactly the same coding pattern, thus fixing pg_class_aclcheck is not an option, apparently. > So I think what we really ought to be doing is one > of two things: > > 1. Hack SearchSysCacheExists to account for this issue, by making it > loop if it finds a syscache entry but sees that the entry is already > dead. (We have to loop, not just return false, in case the row was > updated rather than deleted.) Maybe all the syscache lookup > functions need to do likewise; it's certainly not intuitively > reasonable for them to return already-known-stale entries. > > 2. Figure out how come we are executing a cache inval on the way > out of syscache entry creation, and stop that from happening. I wrote about LockRelationOid() before, and I still think that invalidation messages are processed in that call (reached via SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open(), not via relation_close()). So it seems that SearchSysCacheX calls find an entry, that is not dead, but that entry can be dead (not found) for the next call if invalidation messages are processed. > I like #2 better if it's not hard to do cleanly. However, I'm not > quite sure how we are getting to an inval during relation close; > maybe that's not something we want to prevent. Yes, there is a detailed comment in LockRelationOid(), that explains why AcceptInvalidationMessages() is called. (I've tried to remove that call now, just for testing, and get 6 tests failed during `make check`.) Best regards, Alexander
pgsql-bugs by date: