Home > mailing lists

Re: "SMgrRelation hashtable corrupted" failure identified - Mailing list pgsql-hackers

From	Marc G. Fournier
Subject	Re: "SMgrRelation hashtable corrupted" failure identified
Date	January 10, 2005 16:37:36
Msg-id	20050110123636.W51884@ganymede.hub.org Whole thread Raw
In response to	"SMgrRelation hashtable corrupted" failure identified (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: "SMgrRelation hashtable corrupted" failure identified
List	pgsql-hackers

Tree view

On Mon, 10 Jan 2005, Tom Lane wrote:

> We've seen a few reports of the above-mentioned error message from
> PG 8.0 testers, but up till now no one had come up with a reproducible
> test case.  I've now found a trivial example:
>
> session 1:    create table a1 (f1 varchar(128));
> session 2:    insert into a1 values('abc');
> session 1:    alter table a1 alter column f1 type varchar(256);
> session 2:    insert into a1 values('abcd');
> session 2 fails with ERROR:  SMgrRelation hashtable corrupted
> continued use of session 2 leads to a crash
>
> Many if not all scenarios involving a rewriting ALTER TABLE on a
> table in active use by other backends will fail like this.
> I believe there are probably similar failures involving CLUSTER,
> though a quick try didn't show it.  This seems clearly to be a
> "must fix for 8.0" bug.
>
> The basic problem is that when ALTER TABLE tries to swap the physical
> files associated with the original table and the temp version of the
> table, it sends out relcache inval events for all four combinations
> of table OID and relfilenode.  Because inval.c is a bit cavalier about
> the ordering of inval events, the one that session 2 sees first is the
> one for <temp table OID, old relfilenode>.  It does not find a relcache
> entry for the temp table OID, but it does find an smgr table entry for
> the relfilenode, which it proceeds to drop.  Now there is a dangling
> smgr reference in its relcache, so when it next gets hit with a
> relcache clear event for the original table OID, boom!
>
> I fooled around with trying to patch this by enforcing the "right"
> processing order of inval events, but that doesn't work (it just moves
> the failure into the sending backend, which it turns out would need
> a different processing order to avoid crashing).  It would be a horribly
> fragile solution anyway.
>
> I now think that the only reasonable fix is to directly attack the
> problem of dangling relcache references to smgr table entries.  What we
> can do is add a concept of an "owning pointer" to an smgr entry, that
> is an "SMgrRelation *myowner" field, and have smgrclose do
> something like
>     if (reln->myowner)
>         *(reln->myowner) = NULL;
> For smgr table entries associated with a relcache entry, the relcache
> code would set this field as a back link to its rel->rd_smgr pointer.
> With this setup, an smgr-level clear would correctly unhook from the
> relcache even if the clear did not come directly through the relcache.
> This would simplify RelationCacheInvalidateEntry and
> LocalExecuteInvalidationMessage, which could then treat relcache clear
> and smgr clear as independent operations.
>
> Comments?

Only: Josh, put a hold on those press releases, looks like an RC5 is 
forthcoming ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

pgsql-hackers by date:

From: "Hiroshi Saito"
Date: 10 January 2005, 16:24:53
Subject: Re: [PORTS] PostgreSQL 8 for Win32 -- installation problem

From: Tom Lane
Date: 10 January 2005, 16:40:04
Subject: Re: "SMgrRelation hashtable corrupted" failure identified

Re: "SMgrRelation hashtable corrupted" failure identified - Mailing list pgsql-hackers

Previous

Next