The reason for loris' intermittent prepared_xacts failures - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | The reason for loris' intermittent prepared_xacts failures |
Date | |
Msg-id | 23211.1121643484@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: The reason for loris' intermittent prepared_xacts failures
|
List | pgsql-hackers |
I have what seems a pretty solid theory about the intermittent check failures on buildfarm member loris. Every non-Windows platform executes the parallel regression tests using C locale ... but not Windows, because of this hack in pg_regress.sh: unset LC_COLLATE LC_CTYPE LC_MONETARY LC_MESSAGES LC_NUMERIC LC_TIME LC_ALL LANG LANGUAGE # On Windows the default locale may not be English, so force it case $host_platform in *-*-cygwin*|*-*-mingw32*)LANG=enexport LANG;; esac Because of this, Windows and only Windows runs the parallel regression tests in a database in which LIKE index optimization is disabled. My hypothesis is that when this happens: -- Commit table creation COMMIT PREPARED 'regress-one'; \d pxtest2 ! ERROR: cache lookup failed for relation 27240 SELECT * FROM pxtest2; a --- the OID being complained of is not actually pxtest2's at all, but that of one of the relations that is created and later dropped by one of the tests that runs in parallel with prepared_xacts. And the reason we can see the problem is evident when you look at the query emitted by psql for the \d command: SELECT c.oid, n.nspname, c.relname FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE pg_catalog.pg_table_is_visible(c.oid) AND c.relname ~ '^pxtest2$' ORDER BY 2, 3; With LIKE optimization active, the ~ operator is optimized into an index constraint "c.relname = 'pxtest2'", and so pg_table_is_visible() will only actually be applied to pg_class rows with relname = 'pxtest2', and so there are no race conditions. With LIKE optimization disabled, there is no index constraint, and if pg_table_is_visible() is applied before the other part of the WHERE condition (which it is) then the function is executed on every row of pg_class not only pxtest2's. So there is a race condition wherein pg_table_is_visible() can be applied to a relation that's already been dropped --- and the fact that it uses SnapshotNow rules rather than MVCC makes the window for failure not all that narrow. We've seen this complained of before, of course, so we ought to think in terms of a general-purpose solution not only a fix for the regression test. I can think of several possible, not necessarily mutually exclusive responses: 1. Tweak psql so that it puts the is_visible condition last instead of first. This is pretty fragile since it depends on the assumption that the planner won't reorder the WHERE clauses without need. However, that is true today, and it seems worth doing if only to reduce the execution cost of a lot of is_visible calls during a \d command. 2. Tweak the is_visible family of functions to (a) use the surrounding query's snapshot instead of SnapshotNow, or (b) silently return FALSE instead of raising error when given a bad OID. I dislike (b), though, and feel that (a) should only be undertaken as part of a comprehensive rethinking of the catalog access rules. 3. Don't run the prepared_xacts test in parallel with other tests, or change it to not use \d. Definitely a band-aid, but very low risk. 4. Fix the Windows special case so that we test in C locale not en_US locale (is this even possible on that platform?) Should we be using --no-locale in pg_regress ... or wouldn't that help? Comments? regards, tom lane PS: it seems striking that there is not much variability in the OID complained of in the loris failure logs. You'd expect it to jump around more based on the relative timing of the several parallel tests. (While investigating this problem I added "select 'pxtest2'::regclass::oid" to the test to see what OID pxtest2 had, and it was different on every run on my Linux box.) I think this implies that the Windows port has much more repeatability of inter-process timing than we usually see on Unix platforms. Not sure if this is good, bad, or indifferent, but it's an interesting factoid.
pgsql-hackers by date: