Re: buffer assertion tripping under repeat pgbench load - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: buffer assertion tripping under repeat pgbench load |
Date | |
Msg-id | 20121223091533.GA1653@alap2.fritz.box Whole thread Raw |
In response to | buffer assertion tripping under repeat pgbench load (Greg Smith <greg@2ndQuadrant.com>) |
Responses |
Re: buffer assertion tripping under repeat pgbench load
|
List | pgsql-hackers |
Hi, On 2012-12-23 02:36:42 -0500, Greg Smith wrote: > I'm testing a checkout from a few days ago and trying to complete a day long > pgbench stress test, with assertions and debugging on. I want to make sure > the base code works as expected before moving on to testing checksums. It's > crashing before finishing though. Here's a sample: > > 2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers > (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108 > s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139 > s > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > 2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was > terminated by signal 6: Aborted > 2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END; > > Running the same test set again gave the same crash, so this looks > repeatable. It's not trivial to trigger given the average runtime to crash > I'm seeing. Second sample: > > 2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers > (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310 > s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415 > s > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > 2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was > terminated by signal 6: Aborted > 2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END; > > The important part: > > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: > 1703) > > That's the check done by "AtEOXact_Buffers - clean up at end of > transaction". It fired while executing the "END;"" statement at the end of > the standard pgbench test. The test looks for loose things that weren't > pinned/unpinned correctly by the transaction. I'm not sure if getting a > stack trace at that point will be useful. The ref count mistake could > easily have been made by an earlier statement in the transaction block, > right? > > This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test > runs were automated with pgbench-tools, using the following configuration: > > MAX_WORKERS="4" > SCRIPT="tpc-b.sql" > SCALES="500 1000" > SETCLIENTS="4 8 16 32 64 96" > SETTIMES=3 > RUNTIME=600 > > All of the crashes so far have been at scale=500, and I haven't seen those > finish yet. It failed 7 hours in the first time, then 4 hours in. > > For code reference, the last commit in the source code tree I built against > was: > > af275a12dfeecd621ed9899a9382f26a68310263 > Thu Dec 20 14:23:31 2012 +0200 > > Looking at recent activity, the only buffer pin related changes I saw was > this one: > > commit 62656617dbe49cca140f3da588a5e311b3fc35ea > Date: Mon Dec 3 18:53:31 2012 +0000 > Avoid holding vmbuffer pin after VACUUM. I don't immediately see anything bad here. > This is my first test like this against 9.3 development though, so the cause > could be an earlier commit. I'm just starting with the most recent work as > the first suspect. Next I think I'll try autovacuum=off and see if the > crash goes away. Other ideas are welcome. Something like the (untested) debug message below might be helpful: diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 7be767b..20f61a1 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit) for (i = 0; i < NBuffers; i++) { - Assert(PrivateRefCount[i] == 0); + if (PrivateRefCount[i] != 0) + { + BufferDesc *bufHdr = &BufferDescriptors[i]; + elog(PANIC, "refcount of %s is %u should be 0, globally: %u", + relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum), + PrivateRefCount[i], bufHdr->refcount); + } } }#endif Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: