Thread: VACUUM FULL versus unsafe order-of-operations in DDL commands
So, as the testing rolls on, I started to see some failures in various ALTER-FOREIGN-thingy commands. The cause proved to be that numerous places in foreigncmds.c do this: tuple = SearchSysCacheCopy(...); ... alter the tuple as needed ... rel = heap_open(target-catalog, RowExclusiveLock); simple_heap_update(rel, &tuple->t_self, tuple); heap_close(rel, RowExclusiveLock); rather than the more common pattern in which the catalog is opened first. I confess to not having realized this myself (or if I ever did know it, I'd forgotten), but *the above coding pattern is not safe*. You must get your lock on the catalog *before* looking up the target tuple, else its TID may be obsoleted by a concurrent vacuum full before you've obtained lock on the catalog. Both update and delete operations are at risk in this way. foreigncmds.c is not hard to fix, but the scary aspect of this is the possibility that we've made the same mistake elsewhere, or might do so again in future. Some desultory examination of simple_heap_update and simple_heap_delete calls didn't find any other instances, but I am not sure I didn't miss anything. And this seems like an easy trap to fall into when refactoring (the current work to try to unify operations like ALTER OWNER could easily get into this kind of problem, for instance). I tried to think of some practical way to mechanically test for this type of error, but came up with nothing. Any ideas? regards, tom lane
On Sun, Aug 14, 2011 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So, as the testing rolls on, I started to see some failures in various > ALTER-FOREIGN-thingy commands. The cause proved to be that numerous > places in foreigncmds.c do this: > > tuple = SearchSysCacheCopy(...); > > ... alter the tuple as needed ... > > rel = heap_open(target-catalog, RowExclusiveLock); > > simple_heap_update(rel, &tuple->t_self, tuple); > > heap_close(rel, RowExclusiveLock); > > rather than the more common pattern in which the catalog is opened > first. Interesting. I vaguely recall flipping some of those around (to put the lock acquisition first) before committing the 9.1-era foreign table patch; it didn't seem like an entirely healthy thing to do. But I didn't really have any concrete notion of why it might be dangerous. > foreigncmds.c is not hard to fix, but the scary aspect of this is the > possibility that we've made the same mistake elsewhere, or might do so > again in future. Some desultory examination of simple_heap_update and > simple_heap_delete calls didn't find any other instances, but I am not > sure I didn't miss anything. And this seems like an easy trap to fall > into when refactoring (the current work to try to unify operations like > ALTER OWNER could easily get into this kind of problem, for instance). > > I tried to think of some practical way to mechanically test for this > type of error, but came up with nothing. Any ideas? Hmm. How about setting the TID to an illegal value of some kind when a catcache tuple is extracted without a table lock? Then any subsequent update or delete using that tuple would blow up. I think that'd be way too expensive to do in normal running but perhaps we could have a #define... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company