Thread: VACUUM FULL versus unsafe order-of-operations in DDL commands

VACUUM FULL versus unsafe order-of-operations in DDL commands

From

Tom Lane

Date:

14 August 2011, 15:22:03

So, as the testing rolls on, I started to see some failures in various
ALTER-FOREIGN-thingy commands.  The cause proved to be that numerous
places in foreigncmds.c do this:
tuple = SearchSysCacheCopy(...);
... alter the tuple as needed ...
rel = heap_open(target-catalog, RowExclusiveLock);
simple_heap_update(rel, &tuple->t_self, tuple);
heap_close(rel, RowExclusiveLock);

rather than the more common pattern in which the catalog is opened
first.  I confess to not having realized this myself (or if I ever did
know it, I'd forgotten), but *the above coding pattern is not safe*.
You must get your lock on the catalog *before* looking up the target
tuple, else its TID may be obsoleted by a concurrent vacuum full before
you've obtained lock on the catalog.  Both update and delete operations
are at risk in this way.

foreigncmds.c is not hard to fix, but the scary aspect of this is the
possibility that we've made the same mistake elsewhere, or might do so
again in future.  Some desultory examination of simple_heap_update and
simple_heap_delete calls didn't find any other instances, but I am not
sure I didn't miss anything.  And this seems like an easy trap to fall
into when refactoring (the current work to try to unify operations like
ALTER OWNER could easily get into this kind of problem, for instance).

I tried to think of some practical way to mechanically test for this
type of error, but came up with nothing.  Any ideas?
        regards, tom lane

Re: VACUUM FULL versus unsafe order-of-operations in DDL commands

From

Robert Haas

Date:

14 August 2011, 17:32:01

On Sun, Aug 14, 2011 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So, as the testing rolls on, I started to see some failures in various
> ALTER-FOREIGN-thingy commands.  The cause proved to be that numerous
> places in foreigncmds.c do this:
>
>        tuple = SearchSysCacheCopy(...);
>
>        ... alter the tuple as needed ...
>
>        rel = heap_open(target-catalog, RowExclusiveLock);
>
>        simple_heap_update(rel, &tuple->t_self, tuple);
>
>        heap_close(rel, RowExclusiveLock);
>
> rather than the more common pattern in which the catalog is opened
> first.

Interesting.  I vaguely recall flipping some of those around (to put
the lock acquisition first) before committing the 9.1-era foreign
table patch; it didn't seem like an entirely healthy thing to do.  But
I didn't really have any concrete notion of why it might be dangerous.

> foreigncmds.c is not hard to fix, but the scary aspect of this is the
> possibility that we've made the same mistake elsewhere, or might do so
> again in future.  Some desultory examination of simple_heap_update and
> simple_heap_delete calls didn't find any other instances, but I am not
> sure I didn't miss anything.  And this seems like an easy trap to fall
> into when refactoring (the current work to try to unify operations like
> ALTER OWNER could easily get into this kind of problem, for instance).
>
> I tried to think of some practical way to mechanically test for this
> type of error, but came up with nothing.  Any ideas?

Hmm.  How about setting the TID to an illegal value of some kind when
a catcache tuple is extracted without a table lock?  Then any
subsequent update or delete using that tuple would blow up.  I think
that'd be way too expensive to do in normal running but perhaps we
could have a #define...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company