Re: Postgresql 8.4.1 segfault, backtrace - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: Postgresql 8.4.1 segfault, backtrace |
Date | |
Msg-id | 730.1253833254@sss.pgh.pa.us Whole thread Raw |
In response to | Postgresql 8.4.1 segfault, backtrace (Richard Neill <rn214@cam.ac.uk>) |
Responses |
Re: Postgresql 8.4.1 segfault, backtrace
Re: Postgresql 8.4.1 segfault, backtrace Re: Postgresql 8.4.1 segfault, backtrace |
List | pgsql-bugs |
Michael Brown <mbrown@fensystems.co.uk> writes: >> ... (If you have a spare machine with the same OS and >> the same postgres executables, maybe you could put the core file on that >> and let me ssh in to have a look?) [ ssh details ] Thanks for letting me poke around. What I found out is that the hash_seq_search loop in RelationCacheInitializePhase2 is crashing because it's attempting to examine a hashtable entry that is on the hashtable's freelist!? Given that information I think the cause of the bug is fairly clear: 1. RelationCacheInitializePhase2 loads the rules or trigger descriptions for some system catalog (actually it must be the latter; we haven't got any catalogs with rules attached). 2. By chance, a shared-cache-inval flush comes through while it's doing that, causing all non-open, non-nailed relcache entries to be discarded. Including, in particular, the one that is "next" according to the hash_seq_search's status. 3. Now the loop iterates into the freelist, and kaboom. It will probably fail to fail on entries that are actually discarded, because they still have valid pointers in them ... but as soon as it gets to a never-yet-used freelist entry, it'll do a null dereference. RelationCacheInitializePhase2 is breaking the rules by assuming that it can continue to iterate the hash_seq_search after doing something that might cause a hash entry other than the current one to be discarded. We can probably fix that without too much trouble, eg by restarting the loop after an update. But: the question at this point is why we've never seen such a report before 8.4. If this theory is correct, it's been broken for a *long* time. I can think of a couple of possible explanations: A: the problem can only manifest if this loop has work to do for a relcache entry that is not the last one in its bucket chain. 8.4 might have added more preloaded relcache entries than were there before. Or the 8.4 changes in the hash functions might have shuffled the entries' bucket placement around so that the problem can happen when it couldn't before. B: the 8.4 changes in the shared-cache-inval mechanism might have made it more likely that a freshly started backend could get hit with a relcache flush request. I should think that those changes would have made this *less* likely not more so, so maybe there is an additional bug lurking in that area. I shall go and do some further investigation, but at least it's now clear where to look. Thanks for the report, and for being so helpful in providing information! regards, tom lane
pgsql-bugs by date: