Re: pgsql: Improve gist XLOG code to follow the coding - Mailing list pgsql-committers
From | Tom Lane |
---|---|
Subject | Re: pgsql: Improve gist XLOG code to follow the coding |
Date | |
Msg-id | 8471.1143819326@sss.pgh.pa.us Whole thread Raw |
In response to | Re: pgsql: Improve gist XLOG code to follow the coding (Teodor Sigaev <teodor@sigaev.ru>) |
Responses |
Re: pgsql: Improve gist XLOG code to follow the coding
|
List | pgsql-committers |
Teodor Sigaev <teodor@sigaev.ru> writes: > Here I am. Oleg now is in expedition to solar eclipse (should return soon). Cool ... I hope to see one someday ... > * gistxlog.c:gistContinueInsert > /* > * XXX fall out to avoid making LOG message at bottom of routine. > * I think the logic for when to emit that message is all wrong... > */ > At line 543 itup vector was filled by invalid tuple, so newly filled root > will contains only invalid tuples. Hence, reindex/vacuum full is required. Hmm. That probably needs to be redesigned then. The problem is that as the thing is written, you *must* reinit the buffer here --- the code I removed that checked the page LSN was unsafe. If you want to depend on the prior page contents during replay, you have to give XLogInsert the opportunity to save the whole page when the xlog entry is made. > Sorry, I missed something, what is torn-page problem? The torn-page problem is where a page only gets partially written to disk during a power failure. When we come back up, the LSN might look like the page is up to date, when in reality the data is old/inconsistent. We deal with this by having the first WAL record that touches a given page after a checkpoint contain a copy of the whole page, and then we can rewrite the page from that copy instead of trying to replay the incremental update. I've fixed the code to do this correctly for the gistRedoPageUpdateRecord case, but I'm not sure what we do for gistContinueInsert. BTW, there's another problem with gistContinueInsert: it's not making WAL entries for the actions it takes. It needs to do so --- consider for example the PITR case, where the log will be shipped somewhere else and then followed verbatim. So you've got to add WAL entries for any recovery actions you take after reading the existing WAL entries. The critical-section problem is not related to that; it's about being sure that we don't make any changes to shared buffers that don't have an associated WAL record. (See "WAL dirty-buffer management bug" thread on -hackers.) I fixed the gist code so that it holds a critical section throughout doing an insertion or vacuum operation, but that's really far too wide --- there's too much chance of an elog() somewhere in the process. In particular, we really don't want to call any user-defined datatype functions inside the critical section. So the process needs to be to compute all the required changes *but not make them*, then start the critical section, then apply the changes and make the WAL record. This seemed like a large enough change that I figured I'd better talk with you about it. One idea I had was to generate the WAL record as the output of the decision-making code, and then the critical section would use the WAL record as its guide to what to do to the buffers. BTW, I'm confused about gistadjscans: seems to me that's either broken or unnecessary. Since we no longer hold an exclusive lock on a gist index while inserting into it, the comment at gistscan.c line 33 is certainly wrong now. If the adjustment is still necessary then some other mechanism needs to be devised to get the information to other backends. If it's not necessary then I think we should take it out. I'm not totally familiar with the gist logic, but it looks to me like gistadjscans is only called during an insert into a non-leaf page, which makes me think it might be unnecessary --- do gist index scans ever stop on non-leaf pages? regards, tom lane
pgsql-committers by date: