Thread: Bug #613: Sequence values fall back to previously checkpointed value after crash
Bug #613: Sequence values fall back to previously checkpointed value after crash
From
pgsql-bugs@postgresql.org
Date:
Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of 1 The lower the number the more severe it is. Short Description Sequence values fall back to previously checkpointed value after crash Long Description It's hard to decide if it's devestating or not, since the bug is only apparent when a backend crashes. But when a backenddoes crash the result is pretty awful. When a backend crashes, and subsequently all others are killed off by thepostmaster to avoid shared memory corruption, sequences fall back to whatever value they had the last time the db checkpointed. I say checkpoint because this happens independantly of commits, so you could have a table with a serial columndo 10 committed inserts, crash a backend, and further inserts will fail having duplicate keys. I've tested this with7.2rc2 and 7.2 STABLE using a stock postgresql.conf (all defaults). It seems impossible to me that this is happening. I've looked at the code and seen the comment about how sequences are allocatedin advance. So I figured I'd report it... Steps to reproduce the bug: - Create a sequence, assign it a value - Checkpoint (optional) - Connect to one or more backend - select nextval (on any/all of the connections opened above) from that sequence several times, noting the first and lastvalue returned - kill -9 (or -11) any of the backend processes, the database will automatically kill off all the other backends. - reconnect and select nextval from the sequence and it will be return the first value (from above). Sample Code No file was uploaded with this report
Yikes! I have reproduced this bug. My server logs are: LOG: database system was shut down at 2002-03-08 17:30:03 CET LOG: checkpoint record is at 0/46D018 LOG: redo record is at 0/46D018; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 146; next oid: 16561 LOG: database system is ready ERROR: DefineIndex: operator class "int" not supported by access method "btree" ERROR: Relation 'test' already exists LOG: server process (pid 21627) was terminated by signal 9 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2002-03-11 23:22:50 CET LOG: checkpoint record is at 0/490AB8 LOG: redo record is at 0/490AB8; undo record is at 0/0; shutdown FALSE LOG: next transaction id: 172; next oid: 24753 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: record with zero length at 0/490AF8 LOG: redo is not required LOG: database system is ready I find on reconnection after 'kill -9' that the sequence is indeed set at 1. I did a checkpoint after the sequence creation. --------------------------------------------------------------------------- pgsql-bugs@postgresql.org wrote: > Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of > 1 The lower the number the more severe it is. > > Short Description Sequence values fall back to previously > checkpointed value after crash > > Long Description It's hard to decide if it's devestating or not, > since the bug is only apparent when a backend crashes. But when > a backend does crash the result is pretty awful. When a backend > crashes, and subsequently all others are killed off by the > postmaster to avoid shared memory corruption, sequences fall > back to whatever value they had the last time the db checkpointed. > I say checkpoint because this happens independantly of commits, > so you could have a table with a serial column do 10 committed > inserts, crash a backend, and further inserts will fail having > duplicate keys. I've tested this with 7.2rc2 and 7.2 STABLE > using a stock postgresql.conf (all defaults). > > It seems impossible to me that this is happening. I've looked > at the code and seen the comment about how sequences are allocated > in advance. So I figured I'd report it... > > > Steps to reproduce the bug: - Create a sequence, assign it a > value - Checkpoint (optional) - Connect to one or more backend > - select nextval (on any/all of the connections opened above) > from that sequence several times, noting the first and last > value returned - kill -9 (or -11) any of the backend processes, > the database will automatically kill off all the other backends. > - reconnect and select nextval from the sequence and it will be > return the first value (from above). > > > > > > Sample Code > > > No file was uploaded with this report > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister > command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yikes! I have reproduced this bug. I believe I see the problem: MyLastRecPtr is being used in incompatible ways. The issue is that sequence operations are logged as "outside transaction control", which I believe is intended to mark XLOG records that should be redone whether or not the generating transaction commits. (Or if we ever do xlog UNDO, records that should not be undone at xact abort.) This classification is clearly right as far as it goes. Now MyLastRecPtr is used to chain together the XLOG records that are *within* xact control, so it doesn't get updated when an outside-the-xact record is written. (At each record insert, MyLastRecPtr is used to fill the previous-record-of-xact backlink.) This is also fine. The trouble is that at xact commit, we test to see if the current xact made any loggable changes by checking MyLastRecPtr != 0. Therefore, if we do an xact consisting ONLY of "select nextval()", this test will mistakenly think that no xlog records were written. It will not generate a commit record --- which is no big problem --- and will not write or flush the xlog --- which is a big problem. An immediately following crash will leave the sequence un-advanced. The "no commit record" part of the logic seems okay to me, but we need an independent test to decide whether to write/flush XLog. If we have reported a nextval() value to the client then it seems to me we'd better be certain that XLOG record is flushed to XLog before we report commit to the client. This is certainly fixable. However, here's the kicker: essentially what this means is that we are not treating *reporting a nextval() value to the client* as a commit-worthy event. I do not think this bug explains the past reports that claim a nextval() value *inserted into the database* has been rolled back. Seems to me that a subsequent tuple insertion would create a normal XLog record which we'd flush before commit, and thereby also flush the sequence-update XLog record. Can anyone see a way that this mechanism explains the prior reports? regards, tom lane
Hi Tom, On Tuesday 12 March 2002 16:17, Tom Lane wrote: <snip> > > Can anyone see a way that this mechanism explains the prior reports? Not sure about that, but I really feel the fix for this should go into 7.2.1, just in case the list of patches for that is still being assembled. Regards and best wishes, Justin Clift > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
On Tue, 12 Mar 2002, Tom Lane wrote: > > The "no commit record" part of the logic seems okay to me, but we need > an independent test to decide whether to write/flush XLog. If we have > reported a nextval() value to the client then it seems to me we'd better > be certain that XLOG record is flushed to XLog before we report commit > to the client. I think the part I don't understand is why WAL is being used to update sequence values in the first place when sequences exist independantly of transactions. In previous releases a sequence basically just existed on disk in a specific location and updates to it updated the on disk copy directly since there are no concurrency issues. I do realize that running everything through WAL gives other benefits, so it's not likely to revert back to the old way. But it would seem that the only way to fix it is to flush the XLOG record immediately after the XLogInsert is called, just as if the operation took place within its own transaction. > This is certainly fixable. However, here's the kicker: essentially what > this means is that we are not treating *reporting a nextval() value to > the client* as a commit-worthy event. I do not think this bug explains > the past reports that claim a nextval() value *inserted into the > database* has been rolled back. Seems to me that a subsequent tuple > insertion would create a normal XLog record which we'd flush before > commit, and thereby also flush the sequence-update XLog record. > > Can anyone see a way that this mechanism explains the prior reports? > Actually, that doesn't appear to be the case either because in some of my tests I used a serial column type and I was just inserting data into a table. It would appear that if the sequence is in the same tuple as the data you modified then it won't get logged. What I did was create a table with a serial column and a varchar(255). Inserted 100 rows filled with data, committed. Ran a checkpoint. Checked my sequence values, inserted 10 more rows of data, committed, checked the value of the sequence again. Kill -9 the postmaster. Tried to insert into the table, but to no avail... duplicate key. currval of the sequence and it matched the value right after the checkpoint. I've been able to duplicate that scenario several times. -- Ben