Thread: Isolation tests still falling over routinely
The buildfarm is still showing isolation test failures more days than not, eg http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-09-17%2012%3A43%3A11 and I've personally seen such failures when testing with CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have such fragile timing assumptions? regards, tom lane
Tom Lane wrote: > The buildfarm is still showing isolation test failures more days > than not, eg > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-09-17%2012%3A43%3A11 > and I've personally seen such failures when testing with > CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have > such fragile timing assumptions? I went back over two months, and only found one failure related to an SSI test, and that was because the machine ran out of disk space. There should never be any timing-related failures on the SSI tests, as there is no blocking or deadlocking. If you have seen any failures on isolation tests other than the fk-* tests, I'd be very interested in details. The rest are not related to SSI but test deadlock conditions related to foreign keys. I didn't have anything to do with these but to provide alternate result files for REPEATABLE READ and SERIALIZABLE isolation levels. (I test the installcheck-world target and the isolation tests in those modes frequently, and the fk-deadlock tests were failing every time at those levels.) If I remember right, Alvaro chose these timings to balance run time against chance of failure. Unless we want to remove these deadlock handling tests or ignore failures (which both seem like bad ideas to me), I think we need to bump the long timings by an order of magnitude and just concede that those tests run for a while. -Kevin
Excerpts from Kevin Grittner's message of mar sep 20 22:51:39 -0300 2011: > If I remember right, Alvaro chose these timings to balance run time > against chance of failure. Unless we want to remove these deadlock > handling tests or ignore failures (which both seem like bad ideas to > me), I think we need to bump the long timings by an order of > magnitude and just concede that those tests run for a while. The main problem I have is that I haven't found a way to reproduce the problems in my machine. I was playing with modifying the way the error messages are reported, but that ended up unfinished in a local branch. I'll give it a go once more and see if I can commit so that buildfarm tells us if it works or not. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > The main problem I have is that I haven't found a way to reproduce the > problems in my machine. Try -DCLOBBER_CACHE_ALWAYS. regards, tom lane
Excerpts from Tom Lane's message of mar sep 20 21:30:42 -0300 2011: > > The buildfarm is still showing isolation test failures more days than > not, eg > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-09-17%2012%3A43%3A11 > and I've personally seen such failures when testing with > CLOBBER_CACHE_ALWAYS. Could we please fix those tests to not have such > fragile timing assumptions? The fix has now been installed for two weeks and no new failure has occured. The only failure in the IsolationCheck phase since then was caused by a disk filling up (and it wasn't in the fk-* tests anyway). I think we can consider this issue fixed. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of mar sep 20 21:30:42 -0300 2011: >> Could we please fix those tests to not have such >> fragile timing assumptions? > The fix has now been installed for two weeks and no new failure has > occured. The only failure in the IsolationCheck phase since then was > caused by a disk filling up (and it wasn't in the fk-* tests anyway). > I think we can consider this issue fixed. Yeah, it looks good. Thanks! regards, tom lane