Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers |
Date | |
Msg-id | CA+TgmoanXMjxnr6whEXMpc9Nts2L_sztDV5ck4pviGQ++TNP5A@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
|
List | pgsql-hackers |
On Fri, Mar 10, 2017 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Finally, I had an unexplained hang during the TAP tests while testing > out your fix patch. I haven't been able to reproduce that so it > might've just been an artifact of something stupid I did, or of some > unrelated bug, but I think it's best to back up and reconsider a bit > here. I was able to reproduce this with the following patch: diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index bff42dc..0546425 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -268,9 +268,11 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, * has a race condition (see TransactionGroupUpdateXidStatus)but the * worst thing that happens if we mess up is a small loss of efficiency; * the intentis to avoid having the leader access pages it wouldn't - * otherwise need to touch. Finally, we skip it for prepared transactions, - * which don't have the semaphore we would need for this optimization, - * and which are anyway probably not all that common. + * otherwise need to touch. We also skip it if the transaction status is + * other than commit, because for rollback and rollback to savepoint, the + * list of subxids won't be same as subxids array in PGPROC. Finally, we skip + * it for prepared transactions, which don't have the semaphore we would need + * for this optimization, and which are anyway probably not all that common. */static voidTransactionIdSetPageStatus(TransactionIdxid, int nsubxids, @@ -280,15 +282,20 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,{ if (all_xact_same_page && nsubxids < PGPROC_MAX_CACHED_SUBXIDS && + status == TRANSACTION_STATUS_COMMITTED && !IsGXactActive()) { + Assert(nsubxids == MyPgXact->nxids); + Assert(memcmp(subxids, MyProc->subxids.xids, + nsubxids * sizeof(TransactionId)) == 0); + /* * If we can immediately acquire CLogControlLock, we update the status * of our own XID and releasethe lock. If not, try use group XID * update. If that doesn't work out, fall back to waiting for the * lock to perform an update for this transaction only. */ - if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE)) + if (false && LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE)) { TransactionIdSetPageStatusInternal(xid,nsubxids, subxids, status, lsn, pageno); LWLockRelease(CLogControlLock); make check-world hung here: t/009_twophase.pl .......... 1..13 ok 1 - Commit prepared transaction after restart ok 2 - Rollback prepared transaction after restart [rhaas pgsql]$ ps uxww | grep postgres rhaas 72255 0.0 0.0 2447996 1684 s000 S+ 3:40PM 0:00.00 /Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/psql -XAtq -d port=64230 host=/var/folders/y8/r2ycj_jj2vd65v71rmyddpr40000gn/T/ZVWy0JGbuw dbname='postgres' -f - -v ON_ERROR_STOP=1 rhaas 72253 0.0 0.0 2478532 1548 ?? Ss 3:40PM 0:00.00 postgres: bgworker: logical replication launcher rhaas 72252 0.0 0.0 2483132 740 ?? Ss 3:40PM 0:00.05 postgres: stats collector process rhaas 72251 0.0 0.0 2486724 1952 ?? Ss 3:40PM 0:00.02 postgres: autovacuum launcher process rhaas 72250 0.0 0.0 2477508 880 ?? Ss 3:40PM 0:00.03 postgres: wal writer process rhaas 72249 0.0 0.0 2477508 972 ?? Ss 3:40PM 0:00.03 postgres: writer process rhaas 72248 0.0 0.0 2477508 1252 ?? Ss 3:40PM 0:00.00 postgres: checkpointer process rhaas 72246 0.0 0.0 2481604 5076 s000 S+ 3:40PM 0:00.03 /Users/rhaas/pgsql/tmp_install/Users/rhaas/install/dev/bin/postgres -D /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata rhaas 72337 0.0 0.0 2433796 688 s002 S+ 4:14PM 0:00.00 grep postgres rhaas 72256 0.0 0.0 2478920 2984 ?? Ss 3:40PM 0:00.00 postgres: rhaas postgres [local] COMMIT PREPARED waiting for 0/301D0D0 Backtrace of PID 72256: #0 0x00007fff8ecc85c2 in poll () #1 0x00000001078eb727 in WaitEventSetWaitBlock [inlined] () at /Users/rhaas/pgsql/src/backend/storage/ipc/latch.c:1118 #2 0x00000001078eb727 in WaitEventSetWait (set=0x7fab3c8366c8, timeout=-1, occurred_events=0x7fff585e5410, nevents=1, wait_event_info=<value temporarily unavailable, due to optimizations>) at latch.c:949 #3 0x00000001078eb409 in WaitLatchOrSocket (latch=<value temporarily unavailable, due to optimizations>, wakeEvents=<value temporarily unavailable, due to optimizations>, sock=-1, timeout=<value temporarily unavailable, due to optimizations>, wait_event_info=134217741) at latch.c:349 #4 0x00000001078cf077 in SyncRepWaitForLSN (lsn=<value temporarily unavailable, due to optimizations>, commit=<value temporarily unavailable, due to optimizations>) at syncrep.c:284 #5 0x00000001076a2dab in FinishPreparedTransaction (gid=<value temporarily unavailable, due to optimizations>, isCommit=1 '\001') at twophase.c:2110 #6 0x0000000107919420 in standard_ProcessUtility (pstmt=<value temporarily unavailable, due to optimizations>, queryString=<value temporarily unavailable, due to optimizations>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x7fab3c853cf8, completionTag=<value temporarily unavailable, due to optimizations>) at utility.c:452 #7 0x00000001079186f3 in PortalRunUtility (portal=0x7fab3c874a40, pstmt=0x7fab3c853c00, isTopLevel=1 '\001', setHoldSnapshot=<value temporarily unavailable, due to optimizations>, dest=0x7fab3c853cf8, completionTag=0x7fab3c8366f8 "\n") at pquery.c:1165 #8 0x0000000107917cd6 in PortalRunMulti (portal=<value temporarily unavailable, due to optimizations>, isTopLevel=1 '\001', setHoldSnapshot=0 '\0', dest=0x7fab3c853cf8, altdest=0x7fab3c853cf8, completionTag=<value temporarily unavailable, due to optimizations>) at pquery.c:1315 #9 0x0000000107917634 in PortalRun (portal=0x7fab3c874a40, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fab3c853cf8, altdest=0x7fab3c853cf8, completionTag=0x7fff585e5a30 "") at pquery.c:788 #10 0x000000010791586b in PostgresMain (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>, dbname=<value temporarily unavailable, due to optimizations>, username=<value temporarily unavailable, due to optimizations>) at postgres.c:1101 #11 0x0000000107897a68 in PostmasterMain (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at postmaster.c:4317 #12 0x00000001078124cd in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at main.c:228 debug_query_string is COMMIT PREPARED 'xact_009_1' end of regress_log_009_twophase looks like this: ok 2 - Rollback prepared transaction after restart ### Stopping node "master" using mode immediate # Running: pg_ctl -D /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata -m immediate stop waiting for server to shut down.... done server stopped # No postmaster PID ### Starting node "master" # Running: pg_ctl -D /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_master_Ylq1/pgdata -l /Users/rhaas/pgsql/src/test/recovery/tmp_check/log/009_twophase_master.log start waiting for server to start.... done server started # Postmaster PID for node "master" is 72246 The smoking gun was in 009_twophase_slave.log: TRAP: FailedAssertion("!(nsubxids == MyPgXact->nxids)", File: "clog.c", Line: 288) ...and then the node shuts down, which is why this hangs forever. (Also... what's up with it hanging forever instead of timing out or failing or something?) So evidently on a standby it is in fact possible for the procarray contents not to match what got passed down to clog. Now you might say "well, we shouldn't be using group update on a standby anyway", but it's possible for a hot standby backend to hold a shared lock on CLogControlLock, and then the startup process would be pushed into the group-update path and - boom. Anyway, this is surely fixable, but I think it's another piece of evidence that the assumption that the transaction status stack will match the procarray is fairly fragile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: