Re: [HACKERS] Vacuum analyze bug CAUGHT - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] Vacuum analyze bug CAUGHT |
Date | |
Msg-id | 7113.936888336@sss.pgh.pa.us Whole thread Raw |
In response to | Vacuum analyze bug CAUGHT (Michael Simms <grim@argh.demon.co.uk>) |
Responses |
Re: [HACKERS] Vacuum analyze bug CAUGHT
Re: [HACKERS] Vacuum analyze bug CAUGHT |
List | pgsql-hackers |
Michael Simms <grim@argh.demon.co.uk> writes: > Okee, I have caught the vacuum analyse crash that was giving me a load > of grief I spent a good while last night trying to duplicate your report, but couldn't on either current sources or latest REL6_5 branch. Maybe that means we fixed the problem already --- but I think it's more likely that there's a platform dependency involved, or some additional enabling condition that you forgot to mention. So you'll have to keep poking at it. However, I do have some comments and suggestions. Starting at the end, the final crash: > Program received signal SIGSEGV, Segmentation fault. > AllocSetReset (set=0x0) at aset.c:159 > aset.c:159: No such file or directory. > (gdb) where > #0 AllocSetReset (set=0x0) at aset.c:159 > #1 0x810ac12 in EndPortalAllocMode () at portalmem.c:938 > #2 0x8078833 in AtAbort_Memory () at xact.c:800 > #3 0x80789ff in AbortTransaction () at xact.c:1026 > #4 0x8078aef in AbortCurrentTransaction () at xact.c:1243 > #5 0x80deed6 in PostgresMain (argc=-1073742288, argv=0xbffff7a0, real_argc=7, > real_argv=0xbffffd04) at postgres.c:1550 appears to be the same bug I alluded to before: memory allocation isn't cleaned up properly if elog(ERROR) is executed outside a transaction. I know how to fix this, and will do so before 6.5.2, but fixing it will just prevent a coredump after an error has already occurred. What we need to be looking for in your setup is the reason why an error is being reported at all. The first thing you can do is look to see what the error message is --- it should be in the postmaster's stderr logfile, even though 6.5.* libpq omits to display it because of the crash. Another useful thing would be to set a breakpoint at elog() so that you can examine the context of the error report. (Actually, since vacuum generates lots of elog(NOTICE) and elog(DEBUG), the breakpoint had better be further down in elog, perhaps where it's about to longjmp back to PostgresMain.) BTW, VACUUM runs a separate transaction for each table it works on, so although most of its work is done inside a transaction, there are short intervals between tables where it's not in one. The error must be getting reported during one of these intervals. That narrows things down a lot. Now, about the SIGUSR2. That's mighty suggestive, but it's not necessarily an indication of a bug. There are two reasons why a SIGUSR2 would be delivered to a backend. One is LISTEN/NOTIFY, which I assume we can discount, since you'd have to have explicitly done it. The other is that the SI message buffer manager sends a broadcast SIGUSR2 to all the backends if it thinks its message buffer is getting dangerously full --- presumably that's what you saw happening. So the questions raised are (a) why is the buffer getting full, and (b) could it have actually overflowed later, and if so did the overflow recovery code work? Background: the SI message buffer is a scheme for notifying backends to discard cached copies of tuples from the system tables. Whenever a backend modifies a tuple from the system tables, it has to send out an SI ("shared cache invalidation") message telling the other backends it has changed tuple X in table Y, so that they discard their no-longer-accurate cached copies, if any. Messages in the buffer can be discarded only after all active backends have read them. Stuff like VACUUM tends to produce a lot of SI messages, but table creation/deletion causes some too. The SI buffer is global across an installation (one postmaster and its children), *not* per-database, so even if you only had these two backends connected to your crashtest database, they could have been affected by anything that was going on in other databases belonging to the same postmaster. Were there other backends alive at the time, and if so what were they doing, or not doing? If all the backends are busy, the SI buffer really shouldn't get anywhere near full, although I suppose it could happen under extreme conditions. The case where the buffer tends to fill is when one or more backends are sitting idle, waiting for a client command. They'll only read SI messages during transaction start (or, in 6.6, after acquiring a lock), so an idle backend blocks the SI buffer from emptying. The point of the SIGUSR2 broadcast is to kick-start idle backends so that they will execute a transaction and receive their SI messages. I am guessing that you had an idle backend connected to some other database of the same postmaster, so that the SI buffer was getting full of the messages being generated by the VACUUM and table create/ delete processes. If you had no idle backends then we ought to look for the reason for the SI buffer filling far enough to cause SIGUSR2; but if you did then it's a normal condition. BTW, if you have a backend stopped due to a gdb breakpoint or trap, it's certainly not consuming SI messages. So when you left it sit after observing the SIGUSR2, you altered the behavior. The VACUUM process was still generating SI messages, and even though the hypothetical idle backend would have eaten its messages thanks to SIGUSR2, the one you had blocked with gdb stopped doing so. So eventually there would have been an SI overflow condition. I think that would take longer than "3-5 seconds", though, unless your machine is way faster than mine. If the SI buffer does overflow because someone isn't eating his messages promptly, it's not supposed to be fatal. Rather, the SI manager reports to all the backends "Sorry, I've lost track of what's going on, so I suggest you discard *all* your cached tuples." However, 6.5.* has some bugs in the shared cache reset process. (I think I have fixed these for 6.6, but there are enough poorly-tested revisions in the affected modules that we'd decided not to risk trying to back-patch 6.5.*.) Anyway, back to the point: this is all very suggestive that maybe what you are seeing is SI overflow and a consequent failure. But I'm not convinced of it, partly because of the timing issue and partly because the vacuum process would have issued a "NOTICE: SIReadEntryData: cache state reset" message before entering the buggy code. I didn't see one in your trace. However, if you find instances of this message in your postmaster logfile, then it's definitely a possible cause. (I would still wonder why SI overflow is occurring in the normal case where you're not using gdb to block a backend from responding to SIGUSR2.) I hope this gives you enough info to poke at the problem more intelligently. Lastly, did you build with --enable-cassert? The assert checks slow things down a little, but are often real helpful when looking for backend bugs. regards, tom lane
pgsql-hackers by date: