Thread: Bug #613: Sequence values fall back to previously checkpointed value after crash

Bug #613: Sequence values fall back to previously checkpointed value after crash

From

pgsql-bugs@postgresql.org

Date:

11 March 2002, 16:59:37

Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of 1
The lower the number the more severe it is.

Short Description
Sequence values fall back to previously checkpointed value after crash

Long Description
It's hard to decide if it's devestating or not, since the bug is only apparent when a backend crashes. But when a
backenddoes crash the result is pretty awful. When a backend crashes, and subsequently all others are killed off by
thepostmaster to avoid shared memory corruption, sequences fall back to whatever value they had the last time the db
checkpointed. I say checkpoint because this happens independantly of commits, so you could have a table with a serial
columndo 10 committed inserts, crash a backend, and further inserts will fail having duplicate keys. I've tested this
with7.2rc2 and 7.2 STABLE using a stock postgresql.conf (all defaults).

It seems impossible to me that this is happening. I've looked at the code and seen the comment about how sequences are
allocatedin advance. So I figured I'd report it...

Steps to reproduce the bug:
- Create a sequence, assign it a value
- Checkpoint (optional)
- Connect to one or more backend
- select nextval (on any/all of the connections opened above) from that sequence several times, noting the first and
lastvalue returned

- kill -9 (or -11) any of the backend processes, the database will automatically kill off all the other backends.
- reconnect and select nextval from the sequence and it will be return the first value (from above).

Sample Code

No file was uploaded with this report

Re: Bug #613: Sequence values fall back to previously checkpointed

From

Bruce Momjian

Date:

11 March 2002, 18:03:41

Yikes!  I have reproduced this bug.  My server logs are:

LOG:  database system was shut down at 2002-03-08 17:30:03 CET
LOG:  checkpoint record is at 0/46D018
LOG:  redo record is at 0/46D018; undo record is at 0/0; shutdown TRUE
LOG:  next transaction id: 146; next oid: 16561
LOG:  database system is ready
ERROR:  DefineIndex: operator class "int" not supported by access method "btree"
ERROR:  Relation 'test' already exists
LOG:  server process (pid 21627) was terminated by signal 9
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing shared memory and semaphores
LOG:  database system was interrupted at 2002-03-11 23:22:50 CET
LOG:  checkpoint record is at 0/490AB8
LOG:  redo record is at 0/490AB8; undo record is at 0/0; shutdown FALSE
LOG:  next transaction id: 172; next oid: 24753
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  ReadRecord: record with zero length at 0/490AF8
LOG:  redo is not required
LOG:  database system is ready

I find on reconnection after 'kill -9' that the sequence is indeed set
at 1.  I did a checkpoint after the sequence creation.

---------------------------------------------------------------------------

pgsql-bugs@postgresql.org wrote:
> Ben Grimm (bgrimm@zaeon.com) reports a bug with a severity of
> 1 The lower the number the more severe it is.
>
> Short Description Sequence values fall back to previously
> checkpointed value after crash
>
> Long Description It's hard to decide if it's devestating or not,
> since the bug is only apparent when a backend crashes.  But when
> a backend does crash the result is pretty awful.  When a backend
> crashes, and subsequently all others are killed off by the
> postmaster to avoid shared memory corruption, sequences fall
> back to whatever value they had the last time the db checkpointed.
> I say checkpoint because this happens independantly of commits,
> so you could have a table with a serial column do 10 committed
> inserts, crash a backend, and further inserts will fail having
> duplicate keys.  I've tested this with 7.2rc2 and 7.2 STABLE
> using a stock postgresql.conf (all defaults).
>
> It seems impossible to me that this is happening.  I've looked
> at the code and seen the comment about how sequences are allocated
> in advance.  So I figured I'd report it...
>
>
> Steps to reproduce the bug:  - Create a sequence, assign it a
> value - Checkpoint (optional) - Connect to one or more backend
> - select nextval (on any/all of the connections opened above)
> from that sequence several times, noting the first and last
> value returned - kill -9 (or -11) any of the backend processes,
> the database will automatically kill off all the other backends.
> - reconnect and select nextval from the sequence and it will be
> return the first value (from above).
>
>
>
>
>
> Sample Code
>
>
> No file was uploaded with this report
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister
> command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Bug #613: Sequence values fall back to previously checkpointed

From

Tom Lane

Date:

12 March 2002, 00:20:22

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yikes!  I have reproduced this bug.

I believe I see the problem: MyLastRecPtr is being used in incompatible
ways.

The issue is that sequence operations are logged as "outside transaction
control", which I believe is intended to mark XLOG records that should
be redone whether or not the generating transaction commits.  (Or if we
ever do xlog UNDO, records that should not be undone at xact abort.)
This classification is clearly right as far as it goes.  Now
MyLastRecPtr is used to chain together the XLOG records that are
*within* xact control, so it doesn't get updated when an
outside-the-xact record is written.  (At each record insert,
MyLastRecPtr is used to fill the previous-record-of-xact backlink.)
This is also fine.

The trouble is that at xact commit, we test to see if the current xact
made any loggable changes by checking MyLastRecPtr != 0.  Therefore,
if we do an xact consisting ONLY of "select nextval()", this test will
mistakenly think that no xlog records were written.  It will not
generate a commit record --- which is no big problem --- and will not
write or flush the xlog --- which is a big problem.  An immediately
following crash will leave the sequence un-advanced.

The "no commit record" part of the logic seems okay to me, but we need
an independent test to decide whether to write/flush XLog.  If we have
reported a nextval() value to the client then it seems to me we'd better
be certain that XLOG record is flushed to XLog before we report commit
to the client.

This is certainly fixable.  However, here's the kicker: essentially what
this means is that we are not treating *reporting a nextval() value to
the client* as a commit-worthy event.  I do not think this bug explains
the past reports that claim a nextval() value *inserted into the
database* has been rolled back.  Seems to me that a subsequent tuple
insertion would create a normal XLog record which we'd flush before
commit, and thereby also flush the sequence-update XLog record.

Can anyone see a way that this mechanism explains the prior reports?

            regards, tom lane

Re: Bug #613: Sequence values fall back to previously checkpointed

From

Justin

Date:

12 March 2002, 01:36:21

Hi Tom,

On Tuesday 12 March 2002 16:17, Tom Lane wrote:
<snip>
>
> Can anyone see a way that this mechanism explains the prior reports?

Not sure about that, but I really feel the fix for this should go into 7.2.1,
just in case the list of patches for that is still being assembled.

Regards and best wishes,

Justin Clift

>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: Bug #613: Sequence values fall back to previously checkpointed

From

bgrimm@zaeon.com

Date:

13 March 2002, 14:22:34

On Tue, 12 Mar 2002, Tom Lane wrote:

>
> The "no commit record" part of the logic seems okay to me, but we need
> an independent test to decide whether to write/flush XLog.  If we have
> reported a nextval() value to the client then it seems to me we'd better
> be certain that XLOG record is flushed to XLog before we report commit
> to the client.

I think the part I don't understand is why WAL is being used to update
sequence values in the first place when sequences exist independantly of
transactions.  In previous releases a sequence basically just existed
on disk in a specific location and updates to it updated the on disk
copy directly since there are no concurrency issues.  I do realize that
running everything through WAL gives other benefits, so it's not likely
to revert back to the old way.  But it would seem that the only way to
fix it is to flush the XLOG record immediately after the XLogInsert is
called, just as if the operation took place within its own transaction.

> This is certainly fixable.  However, here's the kicker: essentially what
> this means is that we are not treating *reporting a nextval() value to
> the client* as a commit-worthy event.  I do not think this bug explains
> the past reports that claim a nextval() value *inserted into the
> database* has been rolled back.  Seems to me that a subsequent tuple
> insertion would create a normal XLog record which we'd flush before
> commit, and thereby also flush the sequence-update XLog record.
>
> Can anyone see a way that this mechanism explains the prior reports?
>

Actually, that doesn't appear to be the case either because in some of
my tests I used a serial column type and I was just inserting data into
a table.  It would appear that if the sequence is in the same tuple as
the data you modified then it won't get logged.   What I did was create
a table with a serial column and a varchar(255).  Inserted 100 rows
filled with data, committed.  Ran a checkpoint.  Checked my sequence
values, inserted 10 more rows of data, committed, checked the value of
the sequence again.  Kill -9 the postmaster.  Tried to insert into the
table, but to no avail... duplicate key.  currval of the sequence and
it matched the value right after the checkpoint.  I've been able to
duplicate that scenario several times.

-- Ben