Home > mailing lists
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
Date	March 11, 2017 00:27:07
Msg-id	CA+TgmoanXMjxnr6whEXMpc9Nts2L_sztDV5ck4pviGQ++TNP5A@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
List	pgsql-hackers
Tree view
On Fri, Mar 10, 2017 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Finally, I had an unexplained hang during the TAP tests while testing
> out your fix patch.  I haven't been able to reproduce that so it
> might've just been an artifact of something stupid I did, or of some
> unrelated bug, but I think it's best to back up and reconsider a bit
> here.

I was able to reproduce this with the following patch:

diff --git a/src/backend/access/transam/clog.c
b/src/backend/access/transam/clog.c
index bff42dc..0546425 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -268,9 +268,11 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, * has a race condition (see
TransactionGroupUpdateXidStatus)but the * worst thing that happens if we mess up is a small loss of efficiency; * the
intentis to avoid having the leader access pages it wouldn't
 
- * otherwise need to touch.  Finally, we skip it for prepared transactions,
- * which don't have the semaphore we would need for this optimization,
- * and which are anyway probably not all that common.
+ * otherwise need to touch.  We also skip it if the transaction status is
+ * other than commit, because for rollback and rollback to savepoint, the
+ * list of subxids won't be same as subxids array in PGPROC.  Finally, we skip
+ * it for prepared transactions, which don't have the semaphore we would need
+ * for this optimization, and which are anyway probably not all that common. */static
voidTransactionIdSetPageStatus(TransactionIdxid, int nsubxids,
 
@@ -280,15 +282,20 @@ TransactionIdSetPageStatus(TransactionId xid,
int nsubxids,{    if (all_xact_same_page &&        nsubxids < PGPROC_MAX_CACHED_SUBXIDS &&
+        status == TRANSACTION_STATUS_COMMITTED &&        !IsGXactActive())    {
+        Assert(nsubxids == MyPgXact->nxids);
+        Assert(memcmp(subxids, MyProc->subxids.xids,
+               nsubxids * sizeof(TransactionId)) == 0);
+        /*         * If we can immediately acquire CLogControlLock, we update the status         * of our own XID and
releasethe lock.  If not, try use group XID         * update.  If that doesn't work out, fall back to waiting for the
     * lock to perform an update for this transaction only.         */
 
-        if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
+        if (false && LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))        {
TransactionIdSetPageStatusInternal(xid,nsubxids,
 
subxids, status, lsn, pageno);            LWLockRelease(CLogControlLock);

make check-world hung here:

t/009_twophase.pl ..........
1..13
ok 1 - Commit prepared transaction after restart
ok 2 - Rollback prepared transaction after restart

[rhaas pgsql]$ ps uxww | grep postgres
rhaas 72255   0.0  0.0  2447996   1684 s000  S+    3:40PM   0:00.00
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/psql -XAtq
-d port=64230 host=/var/folders/y8/r2ycj_jj2vd65v71rmyddpr40000gn/T/ZVWy0JGbuw
dbname='postgres' -f - -v ON_ERROR_STOP=1
rhaas 72253   0.0  0.0  2478532   1548   ??  Ss    3:40PM   0:00.00
postgres: bgworker: logical replication launcher
rhaas 72252   0.0  0.0  2483132    740   ??  Ss    3:40PM   0:00.05
postgres: stats collector process
rhaas 72251   0.0  0.0  2486724   1952   ??  Ss    3:40PM   0:00.02
postgres: autovacuum launcher process
rhaas 72250   0.0  0.0  2477508    880   ??  Ss    3:40PM   0:00.03
postgres: wal writer process
rhaas 72249   0.0  0.0  2477508    972   ??  Ss    3:40PM   0:00.03
postgres: writer process
rhaas 72248   0.0  0.0  2477508   1252   ??  Ss    3:40PM   0:00.00
postgres: checkpointer process
rhaas 72246   0.0  0.0  2481604   5076 s000  S+    3:40PM   0:00.03
/Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/postgres -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
rhaas 72337   0.0  0.0  2433796    688 s002  S+    4:14PM   0:00.00
grep postgres
rhaas 72256   0.0  0.0  2478920   2984   ??  Ss    3:40PM   0:00.00
postgres: rhaas postgres [local] COMMIT PREPARED waiting for 0/301D0D0

Backtrace of PID 72256:

#0  0x00007fff8ecc85c2 in poll ()
#1  0x00000001078eb727 in WaitEventSetWaitBlock [inlined] () at
/Users/rhaas/pgsql/src/backend/storage/ipc/latch.c:1118
#2  0x00000001078eb727 in WaitEventSetWait (set=0x7fab3c8366c8,
timeout=-1, occurred_events=0x7fff585e5410, nevents=1,
wait_event_info=<value temporarily unavailable, due to optimizations>)
at latch.c:949
#3  0x00000001078eb409 in WaitLatchOrSocket (latch=<value temporarily
unavailable, due to optimizations>, wakeEvents=<value temporarily
unavailable, due to optimizations>, sock=-1, timeout=<value
temporarily unavailable, due to optimizations>,
wait_event_info=134217741) at latch.c:349
#4  0x00000001078cf077 in SyncRepWaitForLSN (lsn=<value temporarily
unavailable, due to optimizations>, commit=<value temporarily
unavailable, due to optimizations>) at syncrep.c:284
#5  0x00000001076a2dab in FinishPreparedTransaction (gid=<value
temporarily unavailable, due to optimizations>, isCommit=1 '\001') at
twophase.c:2110
#6  0x0000000107919420 in standard_ProcessUtility (pstmt=<value
temporarily unavailable, due to optimizations>, queryString=<value
temporarily unavailable, due to optimizations>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at utility.c:452
#7  0x00000001079186f3 in PortalRunUtility (portal=0x7fab3c874a40,
pstmt=0x7fab3c853c00, isTopLevel=1 '\001', setHoldSnapshot=<value
temporarily unavailable, due to optimizations>, dest=0x7fab3c853cf8,
completionTag=0x7fab3c8366f8 "\n") at pquery.c:1165
#8  0x0000000107917cd6 in PortalRunMulti (portal=<value temporarily
unavailable, due to optimizations>, isTopLevel=1 '\001',
setHoldSnapshot=0 '\0', dest=0x7fab3c853cf8, altdest=0x7fab3c853cf8,
completionTag=<value temporarily unavailable, due to optimizations>)
at pquery.c:1315
#9  0x0000000107917634 in PortalRun (portal=0x7fab3c874a40,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fab3c853cf8,
altdest=0x7fab3c853cf8, completionTag=0x7fff585e5a30 "") at
pquery.c:788
#10 0x000000010791586b in PostgresMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>, dbname=<value temporarily
unavailable, due to optimizations>, username=<value temporarily
unavailable, due to optimizations>) at postgres.c:1101
#11 0x0000000107897a68 in PostmasterMain (argc=<value temporarily
unavailable, due to optimizations>, argv=<value temporarily
unavailable, due to optimizations>) at postmaster.c:4317
#12 0x00000001078124cd in main (argc=<value temporarily unavailable,
due to optimizations>, argv=<value temporarily unavailable, due to
optimizations>) at main.c:228

debug_query_string is COMMIT PREPARED 'xact_009_1'

end of regress_log_009_twophase looks like this:

ok 2 - Rollback prepared transaction after restart
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID
### Starting node "master"
# Running: pg_ctl -D
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata
-l /Users/rhaas/pgsql/src/test/recovery/tmp_check/log/009_twophase_master.log
start
waiting for server to start.... done
server started
# Postmaster PID for node "master" is 72246

The smoking gun was in 009_twophase_slave.log:

TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File:
"clog.c", Line: 288)

...and then the node shuts down, which is why this hangs forever.
(Also... what's up with it hanging forever instead of timing out or
failing or something?)

So evidently on a standby it is in fact possible for the procarray
contents not to match what got passed down to clog.  Now you might say
"well, we shouldn't be using group update on a standby anyway", but
it's possible for a hot standby backend to hold a shared lock on
CLogControlLock, and then the startup process would be pushed into the
group-update path and - boom.

Anyway, this is surely fixable, but I think it's another piece of
evidence that the assumption that the transaction status stack will
match the procarray is fairly fragile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
pgsql-hackers by date:
From: "David G. Johnston"
Date: 11 March 2017, 00:26:29
Subject: Re: [HACKERS] Should we eliminate or reduce HUP from docs?
From: Jim Nasby
Date: 11 March 2017, 00:45:20
Subject: Re: [HACKERS] Need a builtin way to run all tests faster manner
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers - Mailing list pgsql-hackers

Previous

Next