[PATCH] Full support for index LP_DEAD hint bits on standby - Mailing list pgsql-hackers
From | Michail Nikolaev |
---|---|
Subject | [PATCH] Full support for index LP_DEAD hint bits on standby |
Date | |
Msg-id | CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com Whole thread Raw |
Responses |
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
|
List | pgsql-hackers |
Hello, hackers. [ABSTRACT] Execution of queries to hot standby is one of the most popular ways to scale application workload. Most of the modern Postgres installations have two standby nodes for high-availability support. So, utilization of replica's CPU seems to be a reasonable idea. At the same time, some queries (index scans) could be much slower on hot standby rather than on the primary one. It happens because the LP_DEAD index hint bits mechanics is ignored in index scans during recovery. It is done for reasons, of course [1]: * We do this because the xmin on the primary node could easily be * later than the xmin on the standby node, so that what the primary * thinks is killed is supposed to be visible on standby. So for correct * MVCC for queries during recovery we must ignore these hints and check * all tuples. Also, according to [2] and cases like [3], it seems to be a good idea to support "ignore_killed_tuples" on standby. The goal of this patch is to provide full support for index hint bits on hot standby. The mechanism should be based on well-tested functionality and not cause a lot of recovery conflicts. This thread is the continuation (and party copy-paste) of the old previous one [4]. [PROBLEM] The standby itself can set and read hint bits during recovery. Such bits are even correct according to standby visibility rules. But the problem here - is full-page-write WAL records coming from the primary. Such WAL records could bring invalid (according to standby xmin) hint bits. So, if we could be sure the scan doesn’t see any invalid hint bit from primary - the problem is solved. And we will even be able to allow standby to set its LP_DEAD bits itself. The idea is simple: let WAL log hint bits before FPW somehow. It could cause a lot of additional logs, however... But there are ways to avoid it: 1) Send only one `latestRemovedXid` of all tuples marked as dead during page scan. 2) Remember the latest sent `latestRemovedXid` in shared memory. And optimistically skip WAL records with older xid values [5]. Such WAL records would cause a lot of recovery conflicts on standbys. But we could be tricky here - let use hint bits only if hot_standby_feedback is enabled and effective on standby. If HSF is effective - then conflicts are not possible. If HSF is off - then standby ignores both hint bits and additional conflict resolution. The major thing here is that HSF is just optimization and has nothing with MVCC correctness. [DETAILS] The patch introduces a new WAL record (named XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for standbys snapshot to use LP_DEAD bits for an index scan. `table_index_fetch_tuple` now returns `latest_removed_xid` value additionally to `all_dead`. This value is used to advance `killedLatestRemovedXid` at time of updating `killedItems` (see `IndexHintBitAdvanceLatestRemovedXid`). Primary sends the value of `killedLatestRemovedXid` in XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting LP_DEAD bits on the index page (by calling `MarkBufferDirtyIndexHint`). New WAL is always sent before possible FPW. It is required to send such a record only if its `latestRemovedXid` is newer than the one was sent before for the current database (see `LogIndexHintBitsHorizonIfNeeded`). There is a new flag in the PGPROC structure - `indexIgnoreKilledTuples`. If the flag is set to true – standby queries are going to use LP_DEAD bits in index scans. In such a case snapshot is required to satisfice the new horizon pushed by XLOG_INDEX_HINT_BITS_HORIZON records. It is safe to set `indexIgnoreKilledTuples` to any value from the perspective of correctness. But `true` value could cause recovery conflict. It is just some kind of compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa. What is the way to make the right decision about this compromise? It is pretty simple – if `hot_standby_feedback` is on and primary confirmed feedback is received – then set `indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`). While feedback is working as expected – the query will never be canceled by XLOG_INDEX_HINT_BITS_HORIZON. To support cascading standby setups (with a possible break of feedback chain in the middle) – an additional byte was added to the keep-alive message of the feedback protocol. This byte is used to make sure our xmin is honored by primary (see `sender_propagates_feedback_to_primary`). Also, the WAL sender now always sends a keep-alive after receiving a feedback message. So, this way, it is safe to use LP_DEAD bits received from the primary when we want to. And, as a result, it is safe to set LP_DEAD bits on standby. Even if: * the primary changes vacuum_defer_cleanup_age * standby restarted * standby promoted to the primary * base backup taken from standby * standby is serving queries during recovery – nothing could go wrong here. Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs *heap* hint bits to be already set at standby. So, the same code decides to set hint bits on the heap (it is done already on standby for a long time) and in the index. [EVALUATION] It is not possible to find an ideal performance test for such kind of optimization. But there is a possible example in the attachment. It is a standard pgbench schema with an additional index on balance and random balance values. On primary test do next: 1) transfer some money from one random of the top 100 rich accounts to one random of the top 100 poor accounts. 2) calculate the amount of money in the top 10 rich and top 10 poor accounts (and include an additional field to avoid index-only-scan). In the case of standby only step 2 is used. The patched version is about 9x faster for standby queries - like 455 TPS versus 4192 TPS on my system. There is no visible difference for primary. To estimate the additional amount of WAL logs, I have checked records in WAL-segments during different conditions: (pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l) - hot_standby_feedback=off - 5181 of 226274 records ~2% - hot_standby_feedback=on (without load on standby) - 70 of 202594 records ~ 0.03% - hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02% So, with HSF=on (which is the default value) WAL increase is not significant. Also, for HSF=off it should be possible to radically reduce the number of additional WAL logs by using `latestRemovedXid` from other records (like Heap2/CLEAN) in "send only newer xid" optimization (I have skipped it for now for simplicity). [CONCLUSION] The only thing we pay – a few additional WAL records and some additional moderate code complexity. But the support of hint-bits on standby is a huge advantage for many workloads. I was able to get more than a 900% performance boost (and it is not surprising – index hint bits are just great optimization). And it works for almost all index types out of the box. Another major thing here – everything is based on old, well-tested mechanics: query cancelation because of snapshot conflicts, setting heap hint bits on standby, hot standby feedback. [REFERENCES] [1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38 [2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8 [3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org [4] - https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com [5] - https://www.postgresql.org/message-id/flat/CANtu0oigC0%2BH0UkxktyovdLLU67ikM0%2BDw3J4EQqiDDeGhcwsQ%40mail.gmail.com
Attachment
pgsql-hackers by date: