Fixing Simms' vacuum problems - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Fixing Simms' vacuum problems |
Date | |
Msg-id | 15133.937072666@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Fixing Simms' vacuum problems
Re: [HACKERS] Fixing Simms' vacuum problems Re: [HACKERS] Fixing Simms' vacuum problems |
List | pgsql-hackers |
Michael Simms was kind enough to give me login privileges on his system to poke at his problems with vacuum running concurrently with table create/drop operations. I am not sure why his setup seems to display the problem easier than mine does, but it's certainly true that crashes occur very easily there, whereas it often takes many tries for me. Anyway, I am now convinced that his symptoms are indeed explained by the locking and cache-invalidation problems we have been discussing. I saw a number of different failures, but they all seemed to trace back to one of two common themes: (1) The non-vacuuming backend crashes because of accessing a system-relation tuple that isn't in the same place anymore: the tuple is found in the local syscache, but the item location recorded there is stale because vacuum has moved the tuple, and the non-vacuum process hasn't noticed the SI update message for it yet. (2) The vacuuming backend can fail because of trying to vacuum a relation that's already been deleted. This can be blamed on the known bug that DROP TABLE releases its exclusive lock on the target table before end of transaction. I expect there are also failures due to the lack-of-lock problems that Hiroshi recently identified, but I didn't happen to see any of those in the limited number of cases that I watched with the debugger. So, it looks like a solution involves two components: first, being more careful to lock system relations appropriately, and second, being sure that SI messages are seen soon enough. I think the read-SI-messages- at-lock-time code that's already in place for 6.6 will be sufficient for the second point, if we are religious about acquiring appropriate locks. (BTW, I think that in most cases an appropriate lock on a system table will be less strong than AccessExclusiveLock --- Vadim, do you agree?) Once we have the changes, the next question is do we want to risk back-patching them into 6.5.2? I can see several ways that we could proceed: 1. Back-patch into REL6_5, and postpone 6.5.2 release for a while for beta-testing. 2. Put out 6.5.2 now (since it already has several other useful fixes), then back-patch, and release 6.5.3 after a beta-testinginterval. 3. Leave these changes out of 6.5.*, and try to get 6.6 out the door soon instead. I am not eager to hurry 6.6 along --- I have a lot of half-done work in the planner/optimizer that I'd like to finish for 6.6. Perhaps choice #2 is the way to go. Comments? regards, tom lane
pgsql-hackers by date: