Thread: [HACKERS] standby server crashes hard on out-of-disk-space in HEAD
logfile from a standby server: 2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1 2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device 2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space. 2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI: 2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000,refcount=1 1) TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) 2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted 2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes 2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down The FATAL is fine, but we shouldn't have that WARNING I think, and certainly not the assertion failure. regards, tom lane
(On Mon, Jun 12, 2017 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > logfile from a standby server: > > 2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1 > 2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device > 2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space. > 2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI: > 2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000,refcount=1 1) > TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > 2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted > 2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes > 2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down > > The FATAL is fine, but we shouldn't have that WARNING I think, and > certainly not the assertion failure. Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local tracking of buffer pins memory efficient., vintage 2014) seems like a likely culprit here, but I haven't tested. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-06-12 15:12:23 -0400, Robert Haas wrote: > (On Mon, Jun 12, 2017 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > logfile from a standby server: > > > > 2017-06-12 11:43:46.450 EDT [13605] LOG: started streaming WAL from primary at 3/E6000000 on timeline 1 > > 2017-06-12 11:47:46.992 EDT [11261] FATAL: could not extend file "base/47578/54806": No space left on device > > 2017-06-12 11:47:46.992 EDT [11261] HINT: Check free disk space. > > 2017-06-12 11:47:46.992 EDT [11261] CONTEXT: WAL redo at 8/EC7E0CF8 for XLOG/FPI: > > 2017-06-12 11:47:46.992 EDT [11261] WARNING: buffer refcount leak: [1243] (rel=base/47578/54806, blockNum=5249, flags=0x8a000000,refcount=1 1) > > TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523) > > 2017-06-12 11:47:47.567 EDT [11259] LOG: startup process (PID 11261) was terminated by signal 6: Aborted > > 2017-06-12 11:47:47.567 EDT [11259] LOG: terminating any other active server processes > > 2017-06-12 11:47:47.584 EDT [11259] LOG: database system is shut down > > > > The FATAL is fine, but we shouldn't have that WARNING I think, and > > certainly not the assertion failure. Just for clarification: It's a WARNING so we print all missed leaks, rather than erroring/asserting at the first leak. We've for a long while Asserted there's not a single pin failure (in earlier releases we asserted out at the first leak). > Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local > tracking of buffer pins memory efficient., vintage 2014) seems like a > likely culprit here, but I haven't tested. I'm not that sure. As written above, the Assert isn't new, and given this hasn't been reported before, I'm a bit doubtful that it's a general refcount tracking bug. The FPI code has been whacked around more heavily, so it could well be a bug in it somewhere. - Andres
On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-06-12 15:12:23 -0400, Robert Haas wrote: >> Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local >> tracking of buffer pins memory efficient., vintage 2014) seems like a >> likely culprit here, but I haven't tested. > > I'm not that sure. As written above, the Assert isn't new, and given > this hasn't been reported before, I'm a bit doubtful that it's a general > refcount tracking bug. The FPI code has been whacked around more > heavily, so it could well be a bug in it somewhere. Something doing a bisect could just use a VM that puts the standby on a tiny partition. I remember seeing this assertion failure some time ago on a test deployment, and that was really surprising. I think that this may be hiding something, so we should really try to investigate more what's wrong here. -- Michael
On Mon, Aug 28, 2017 at 9:06 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2017-06-12 15:12:23 -0400, Robert Haas wrote: >>> Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local >>> tracking of buffer pins memory efficient., vintage 2014) seems like a >>> likely culprit here, but I haven't tested. >> >> I'm not that sure. As written above, the Assert isn't new, and given >> this hasn't been reported before, I'm a bit doubtful that it's a general >> refcount tracking bug. The FPI code has been whacked around more >> heavily, so it could well be a bug in it somewhere. > > Something doing a bisect could just use a VM that puts the standby on > a tiny partition. I remember seeing this assertion failure some time > ago on a test deployment, and that was really surprising. I think that > this may be hiding something, so we should really try to investigate > more what's wrong here. I have been playing a bit with the builds and the attached script triggering out-of-space errors on a standby (adapt to your environment), and while looking for a good commit, I have found that this thing is a bit older than the 2014 vintage... Down to the merge-base of REL9_4_STABLE and REL9_3_STABLE, the assertion failure is still the same. The failure is older than even 9.2, for example by testing at the merge-base of 9.2 and 9.3: CONTEXT: xlog redo insert(init): rel 1663/16384/16385; tid 181441/1 TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1788) But well this assertion got changed in dcafdbcd. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers