Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS |
Date | |
Msg-id | 20141215154929.GI5023@alap3.anarazel.de Whole thread Raw |
In response to | Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS
|
List | pgsql-hackers |
Hi, On 2014-12-15 10:15:30 -0500, Tom Lane wrote: > The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in > contrib/test_decoding with > > TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981) > > for example here: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09 > and here: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05 Yes, I've been trying to debug that. I've finally managed to reproduce locally. Unfortunately it's not directly reproducible on my laptop... A fair amount of tinkering later I've found out that it's not actually CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens to make parts of the code slow enough to not reach a ordinary AcceptInvalidationMessages() in time. The problem is that in the added regression test there can be a situation where there's a relcache entry that's *not* currently visible but still referenced. Consider: SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); CREATE TABLE somechange(id serial primary key); INSERT INTO changeresult SELECT data FROM pg_logical_slot_get_changes(...); While get_changes stores it data into a tuplestore there can be a moment where 'changeresult' has a refcnt > 0 due to the INSERT, but is invisible because we're using a historic snapshot from when changeresult doesn't exist. Without catchup invalidations and/or an outside reference to a relation that's normally not a problem because it won't get reloaded from the caches at an inappropriate time while invisible. Until a few weeks ago there was no regression test covering that case which is why these crashes are only there now. It triggers via: RelationCacheInvalidate() -> RelationClearRelation(relation, true) -> /* Build temporary entry, but don't link it intohashtable */ newrel = RelationBuildDesc(save_relid, false); if (newrel == NULL) { /* Shouldonly get here if relation was deleted */ RelationCacheDelete(relation); RelationDestroyRelation(relation,false); elog(ERROR, "relation %u deleted while still in use", save_relid); } That path is only hit while refcnt > 0 RelationDestroyRelation() has Assert(RelationHasReferenceCountZero(relation)); So that path doesn't really work... Without assertions we'd "just" get a transient ERROR. I think that path should generally loose the RelationDestroyRelation() - if it's referenced that's surely not the right thing to do. I'm not actually convinced logical decoding is the only way that assert can be reached. Since logical decoding happens inside a subtransaction (when called via SQL) and invalidate caches one relatively straightforward way to fix this would be to add something roughly akin to: Assert(!relation->rd_isvalid) /* * This should only happen whenwe're doing logical decoding where * it can happen when [above]. */ if (HistoricSnapshotActive()) return; There's no chance that such a entry could survive too long in an invalid state (minus preexisting bugs independent of decoding) since we hit the same paths that subtransaction abort hits. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: