Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) |
Date | |
Msg-id | 4F3BD6E1.40904@enterprisedb.com Whole thread Raw |
In response to | Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Scaling XLog insertion (was Re: Moving more work
outside WALInsertLock)
|
List | pgsql-hackers |
On 13.02.2012 19:13, Fujii Masao wrote: > On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 13.02.2012 01:04, Jeff Janes wrote: >>> >>> Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. >>> I have no idea if I did it correctly, in particular if calling >>> GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side >>> effects that make that a bad thing to do. I'm not proposing it as the >>> real fix, I just wanted to get around this problem in order to do more >>> testing. >> >> >> Thanks. That's basically the right approach. Attached patch contains a >> cleaned up version of that. >> >> >>> It does get rid of the "there is no contrecord flag" errors, but >>> recover still does not work. >>> >>> Now the count of tuples in the table is always correct (I never >>> provoke a crash during the initial table load), but sometimes updates >>> to those tuples that were reported to have been committed are lost. >>> >>> This is more subtle, it does not happen on every crash. >>> >>> It seems that when recovery ends on "record with zero length at...", >>> that recovery is correct. >>> >>> But when it ends on "invalid magic number 0000 in log file.." then the >>> recovery is screwed up. >> >> >> Can you write a self-contained test case for that? I've been trying to >> reproduce that by running the regression tests and pgbench with a streaming >> replication standby, which should be pretty much the same as crash recovery. >> No luck this far. > > Probably I could reproduce the same problem as Jeff got. Here is the test case: > > $ initdb -D data > $ pg_ctl -D data start > $ psql -c "create table t (i int); insert into t > values(generate_series(1,10000)); delete from t" > $ pg_ctl -D data stop -m i > $ pg_ctl -D data start > > The crash recovery emitted the following server logs: > > LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST > LOG: database system was not properly shut down; automatic recovery in progress > LOG: redo starts at 0/179CC90 > LOG: invalid magic number 0000 in log file 0, segment 1, offset 8060928 > LOG: redo done at 0/17AD858 > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > > After recovery, I could not see the table "t" which I created before: > > $ psql -c "select count(*) from t" > ERROR: relation "t" does not exist Are you still seeing this failure with the latest patch I posted (http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)? That includes Jeff's fix for the original crash you and Jeff saw. With that version, I can't get a crash anymore. I also can't reproduce the inconsistency that Jeff still saw with his fix (http://archives.postgresql.org/message-id/CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+ORfiYj6ow@mail.gmail.com). Jeff, can you clarify if you're still seeing an issue with the latest version of the patch? If so, can you give a self-contained test case for that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: