Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
Date
Msg-id CABPTF7USnHTLLzg+KsNTb0dCONM=crM5tdH5HUZr4kNOq7tqAw@mail.gmail.com
Whole thread Raw
In response to Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,

On Fri, Oct 31, 2025 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 9:21 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi,
> >
> > On Wed, Oct 29, 2025 at 8:17 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Sat, Oct 25, 2025 at 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Oct 23, 2025 at 1:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > > Hi Sawada-san, Michael,
> > > > >
> > > > > Thanks for your comments on this patch.
> > > > >
> > > > > On Thu, Oct 23, 2025 at 8:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Thanks for looking into this.
> > > > > > >
> > > > > > > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote:
> > > > > > > > >
> > > > > > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote:
> > > > > > > > > > Indeed, these changes look correct.
> > > > > > > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > > > > > > > > > anything here.
> > > > > > > > >
> > > > > > > > > Perhaps a lack of time back in 2014?  It feels like an item where we
> > > > > > > > > would need to research a bit some of the past threads, and see if this
> > > > > > > > > has been discussed, or if there were other potential alternatives
> > > > > > > > > discussed.  This is not saying that what you are doing in this
> > > > > > > > > proposal is actually bad, but it's a bit hard to say what an
> > > > > > > > > "algorithm" should look like in this specific code path with XID
> > > > > > > > > manipulations.  Perhaps since 2014, we may have other places in the
> > > > > > > > > tree that share similar characteristics as what's done here.
> > > > > > > > >
> > > > > > > > > So it feels like this needs a bit more historical investigation first,
> > > > > > > > > rather than saying that your proposal is the best choice on the table.
> > > > >
> > > > > Following these suggestions, I carefully searched the mailing list
> > > > > archives and found no reports of performance issues directly related
> > > > > to this code path. I also examined other parts of the codebase for
> > > > > similar patterns. Components like integerset might share some
> > > > > characteristics with SnapBuildPurgeOlderTxn, but they have constraints
> > > > > that make them not directly applicable here. I am not very familiar
> > > > > with the whole tree, so the investigation might not be exhaustive.
> > > >
> > > > Thank you for looking through the archives. In logical replication,
> > > > performance problems typically show up as replication delays. Since
> > > > logical replication involves many different components and processes,
> > > > it's quite rare for investigations to trace problems back to this
> > > > specific piece of code. However, I still believe it's important to
> > > > optimize the performance of logical decoding itself.
> > > >
> > > > >
> > > > > > >
> > > > > > > A comparable optimization exists in KnownAssignedXidsCompress()  which
> > > > > > > uses the same algorithm to remove stale XIDs without workspace
> > > > > > > allocation. That implementation also adds a lazy compaction heuristic
> > > > > > > that delays compaction until a threshold of removed entries is
> > > > > > > reached, amortizing the O(N) cost across multiple operations.
> > > > > > >
> > > > > > > The comment above the data structure mentions the trade-off of keeping
> > > > > > > the committed.xip array sorted versus unsorted. If the array were
> > > > > > > sorted, we could use a binary search combined with memmove to compact
> > > > > > > it efficiently, achieving O(log n + n) complexity for purging.
> > > > > > > However, that design would increase the complexity of
> > > > > > > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt
> > > > > > > wraparound".
> > > > > > >
> > > > > > > /*
> > > > > > > * Array of committed transactions that have modified the catalog.
> > > > > > > *
> > > > > > > * As this array is frequently modified we do *not* keep it in
> > > > > > > * xidComparator order. Instead we sort the array when building &
> > > > > > > * distributing a snapshot.
> > > > > > > *
> > > > > > > * TODO: It's unclear whether that reasoning has much merit. Every
> > > > > > > * time we add something here after becoming consistent will also
> > > > > > > * require distributing a snapshot. Storing them sorted would
> > > > > > > * potentially also make it easier to purge (but more complicated wrt
> > > > > > > * wraparound?). Should be improved if sorting while building the
> > > > > > > * snapshot shows up in profiles.
> > > > > > > */
> > > > > > > TransactionId *xip;
> > > > > > > } committed;
> > > > > > >
> > > > > > > /*
> > > > > > > * Keep track of a new catalog changing transaction that has committed.
> > > > > > > */
> > > > > > > static void
> > > > > > > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
> > > > > > > {
> > > > > > > Assert(TransactionIdIsValid(xid));
> > > > > > >
> > > > > > > if (builder->committed.xcnt == builder->committed.xcnt_space)
> > > > > > > {
> > > > > > > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
> > > > > > >
> > > > > > > elog(DEBUG1, "increasing space for committed transactions to %u",
> > > > > > > (uint32) builder->committed.xcnt_space);
> > > > > > >
> > > > > > > builder->committed.xip = repalloc(builder->committed.xip,
> > > > > > > builder->committed.xcnt_space * sizeof(TransactionId));
> > > > > > > }
> > > > > > >
> > > > > > > /*
> > > > > > > * TODO: It might make sense to keep the array sorted here instead of
> > > > > > > * doing it every time we build a new snapshot. On the other hand this
> > > > > > > * gets called repeatedly when a transaction with subtransactions commits.
> > > > > > > */
> > > > > > > builder->committed.xip[builder->committed.xcnt++] = xid;
> > > > > > > }
> > > > > > >
> > > > > > > It might be worth profiling this function to evaluate whether
> > > > > > > maintaining a sorted array could bring potential benefits, although
> > > > > > > accurately measuring its end-to-end impact could be difficult if it
> > > > > > > isn’t a known hotspot. I also did a brief search on the mailing list
> > > > > > > and found no reports of performance concerns or related proposals to
> > > > > > > optimize this part of the code.
> > > > > >
> > > > > > It might also be worth researching what kind of workloads would need a
> > > > > > better algorithm in terms of storing/updating xip and subxip arrays
> > > > > > since it would be the primary motivation. Also, otherwise we cannot
> > > > > > measure the real-world impact of a new algorithm. Having said that,
> > > > > > find that it would be discussed and developed separately from the
> > > > > > proposed patch on this thread.
> > > > >
> > > > > +1 for researching the workloads that might need a sorted array and
> > > > > more efficient algorithm. This exploration isn’t limited to the scope
> > > > > of SnapBuildPurgeOlderTxn but relates more broadly to the overall
> > > > > snapbuild process, which might be worth discussing in a separate
> > > > > thread as suggested.
> > > > >
> > > > > * TODO: It's unclear whether that reasoning has much merit. Every
> > > > > * time we add something here after becoming consistent will also
> > > > > * require distributing a snapshot. Storing them sorted would
> > > > > * potentially also make it easier to purge (but more complicated wrt
> > > > > * wraparound?). Should be improved if sorting while building the
> > > > > * snapshot shows up in profiles.
> > > > >
> > > > > I also constructed an artificial workload to try to surface the qsort
> > > > > call in SnapBuildBuildSnapshot, though such a scenario seems very
> > > > > unlikely to occur in production.
> > > > >
> > > > >   for ((c=1; c<=DDL_CLIENTS; c++)); do
> > > > >     (
> > > > >       local seq=1
> > > > >       while (( $(date +%s) < tB_end )); do
> > > > >         local tbl="hp_ddl_${c}_$seq"
> > > > >         "$psql" -h 127.0.0.1 -p "$PORT" -d postgres -c "
> > > > >           BEGIN;
> > > > >           CREATE TABLE ${tbl} (id int, data text);
> > > > >           CREATE INDEX idx_${tbl} ON ${tbl} (id);
> > > > >           INSERT INTO ${tbl} VALUES ($seq, 'd');
> > > > >           DROP TABLE ${tbl};
> > > > >           COMMIT;" >/dev/null 2>&1 || true
> > > > >         seq=$((seq+1))
> > > > >       done
> > > > >     )
> > > >
> > > > Interesting. To be honest, I think this scenario might actually occur
> > > > in practice, especially in cases where users frequently use CREATE
> > > > TEMP TABLE ... ON COMMIT DROP.
> > >
> > > The attached file shows the flamegraph for this workload.
> > >
> > > for ((c=1; c<=CLIENTS_DDL; c++)); do
> > > (
> > > local seq=1
> > > while (( $(date +%s) < tB_end )); do
> > > local tbl="hp_temp_${c}_$seq"
> > > $psql_cmd -d postgres -c "
> > > BEGIN;
> > > CREATE TEMP TABLE ${tbl} (id int, data text) ON COMMIT DROP;
> > > CREATE INDEX idx_${tbl} ON ${tbl} (id);
> > > INSERT INTO ${tbl} VALUES ($seq, 'd');
> > > COMMIT;" >/dev/null 2>&1 || true
> > > seq=$((seq+1))
> > > done
> > > ) &
> > > pidsB+=($!)
> > > done
> > >
> > > I’ve been thinking about possible ways to maintain a sorted array to
> > > optimize this case and purge function. Below are some preliminary
> > > ideas. If any of them seem reasonable, I’d like to start a new thread
> > > and prepare a patch based on this direction.
> > >
> > > 1) Batch insertion with globally sorted array (raw uint32 order)
> > >
> > > Goal: Maintain committed.xip sorted and deduplicated to eliminate
> > > repeated qsorts.
> > >
> > > Mechanism:
> > > In SnapBuildCommitTxn, collect all XIDs/subxids to be tracked into a
> > > local vector (length m)
> > > Sort and deduplicate the batch: O(m log m)
> > > Reverse-merge the batch into the global array: O(N + m), deduplicating
> > > on the fly
> > > Invariant: committed.xip[0..xcnt) is always raw-sorted (xidComparator
> > > order) and deduplicated
> > >
> > > Complexity improvement:
> > > Before: O((N+m) log (N+m)) per build
> > > After: O(m log m + N) per commit; snapshot build can skip qsort entirely
> > >
> > > Purge:
> > > Remains O(N) linear scan. Raw sorting doesn't enable binary search
> > > because the purge predicate uses wraparound-aware semantics
> > > (NormalTransactionIdPrecedes), and committed.xip can span epochs, so
> > > numeric order ≠ logical XID order.
> > >
> > > 2) Adopt FullTransactionId for sublinear purge (theoretically possible)
> > >
> > > Rationale:
> > > To make purge O(log N), the array needs to be sorted under the same
> > > ordering relation as the purge predicate. Wraparound-aware comparison
> > > of 32-bit XIDs is not a total order when the set spans epochs.
> > > FullTransactionId (epoch<<32 | xid) could provide a true total order.
> > >
> > > Mechanism:
> > > Store FullTransactionId *fxip instead of TransactionId *xip
> > > struct
> > > {
> > >     size_t xcnt;
> > >     size_t xcnt_space;
> > >     bool includes_all_transactions;
> > >     FullTransactionId *fxip; /* Changed from TransactionId *xip */
> > > } committed;
> > >
> > > Insertion: Map each 32-bit XID to FullTransactionId using a snapshot
> > > of nextFullXid (same logic as hot standby feedback); sort/dedup batch;
> > > reverse-merge O(N + m)
> > >
> > > Purge: Compute xmin_full using the same snapshot; binary search for
> > > lower bound O(log N) + memmove suffix
> > >
> > > /*
> > > * xidLogicalComparator
> > > * qsort comparison function for XIDs
> > > *
> > > * This is used to compare only XIDs from the same epoch (e.g. for backends
> > > * running at the same time). So there must be only normal XIDs, so there's
> > > * no issue with triangle inequality.
> > > */
> > > int
> > > xidLogicalComparator(const void *arg1, const void *arg2)
> > >
> > > Trade-offs:
> > > Downcasting a logically sorted FullTransactionId array can break raw
> > > uint32 ordering if the set spans an epoch boundary. Still need a qsort
> > > on snapshot->xip after copying xids from committed.xip at worst case.
> > >
> > > Memory/I/O: 2× bytes for committed array (4→8 bytes per entry)
> > >
> > > Purge improves from O(N) to O(log N + move); merge complexity unchanged
> >
> > I’ve implemented an experimental version that maintains the
> > committed.xip array in numeric sorted order and profiled it on the
> > previous workload. The overhead observed earlier has now been
> > eliminated.
> > Before starting a discussion thread and proposing the patch, I plan to
> > run additional pgbench workloads to verify that there are no
> > performance regressions. Once the results look stable, I’ll polish and
> > share the patch for review.
>
> +1. I've not reviewed the patch yet. I think we need to evaluate how
> much the patch makes logical decoding performance better and how much
> potential regressions it could have (in the base and worst cases for
> each).
>

I conducted several benchmark tests on this patch, and here are some
observations:

1) Workloads & settings
# Workload MIXED - Realistic mix of DDL and DML
create_mixed_workload() {
  local ROOT="$1"
  cat >"$ROOT/mixed_ddl.sql" <<'SQL'
-- DDL workload (catalog changes)
DO $$
DECLARE
  tbl text := format('t_%s_%s',
                     current_setting('application_name', true),
                     floor(random()*1e9)::int);
BEGIN
  EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
  EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
END$$;

SQL
  cat >"$ROOT/mixed_dml.sql" <<'SQL'
-- DML workload (no catalog changes)
INSERT INTO app_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}

# Workload CONTROL - Pure DML, no catalog changes
create_control_workload() {
  local ROOT="$1"
  cat >"$ROOT/control.sql" <<'SQL'

-- Pure DML, no catalog changes
-- Should show no difference between baseline and patched
INSERT INTO control_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}

# Start workload 100 clients, duration 40s, 1 run
local pids=()
for ((c=1; c<=CLIENTS; c++)); do
(
local end=$(($(date +%s) + DURATION))
while (( $(date +%s) < end )); do
"$psql" -h 127.0.0.1 -p "$PORT" -d postgres \
-v ON_ERROR_STOP=0 \
-f "$SQL_FILE" >/dev/null 2>&1 || true
done
) &
pids+=($!)
done

"SELECT pg_create_logical_replication_slot('$SLOT', 'test_decoding',
false, true);" \

"SELECT COALESCE(total_txns, 0), COALESCE(total_bytes, 0) FROM
pg_stat_replication_slots WHERE slot_name='$SLOT';")

shared_buffers = '4GB'
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
log_min_messages = warning
max_connections = 600
autovacuum = off
checkpoint_timeout = 15min
max_wal_size = 4GB

2) Performance results

=== Workload: mixed ===
Client commits/sec:
  Baseline:  7845.82 commits/sec
  Patched:   7747.88 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  750.10 txns/sec  (646.80 MB/s)
  Patched:   2440.03 txns/sec  (2052.32 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  313833 committed  →  30004 decoded  (9.56%)
  Patched:   309915 committed  →  97601 decoded  (31.49%)

Total decoded (all reps):
  Baseline:  30004 txns  (25872.01 MB)
  Patched:   97601 txns  (82092.83 MB)

Decoder improvement: +225.00% (txns/sec)
Decoder improvement: +217.00% (MB/s)
Efficiency improvement: +21.93% points (more transactions decoded per committed)

=== Workload: control ===
Client commits/sec:
  Baseline:  6756.80 commits/sec
  Patched:   6643.95 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  3373.28 txns/sec  (0.29 MB/s)
  Patched:   3316.15 txns/sec  (0.28 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  270272 committed  →  134931 decoded  (49.92%)
  Patched:   265758 committed  →  132646 decoded  (49.91%)

Total decoded (all reps):
  Baseline:  134931 txns  (11.56 MB)
  Patched:   132646 txns  (11.37 MB)

3) Potential regression

The potential regression point could be before the slot reaches the
CONSISTENT state, particularly when building_full_snapshot is set to
true. In this phase, all transactions including those that don’t
modify the catalog — must be added to the committed.xip array. These
XIDs don’t require later snapshot builds or sorting, so the
batch-insert logic increases the per-insert cost from O(1) to O(m + n)
without providing a direct benefit.

However, the impact of this regression could be limited. The system
remains in the pre-CONSISTENT phase only briefly during initial
snapshot building, and the building_full_snapshot = true case is rare,
mainly used when creating replication slots with the EXPORT_SNAPSHOT
option.

Once the slot becomes CONSISTENT, only catalog-modifying transactions
are tracked in committed.xip, and the patch reduces overall
snapshot-building overhead by eliminating repeated full-array sorts.

We could also adopt a two-phase approach — keeping the current
behavior before reaching the CONSISTENT state and maintaining a sorted
array only after that point. This would preserve the performance
benefits while avoiding potential regressions. However, it would
introduce additional complexity and potential risks in handling the
state transitions.


if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->start_decoding_at <= lsn)
builder->start_decoding_at = lsn + 1;

/*
* If building an exportable snapshot, force xid to be tracked, even
* if the transaction didn't modify the catalog.
*/
if (builder->building_full_snapshot)
{
needs_timetravel = true;
}
}

It also occurs to me that we can optimize the purge operation by using
two binary searches to locate the interval to keep when the array is
sorted.

Best,
Xuneng



pgsql-hackers by date:

Previous
From: Rafia Sabih
Date:
Subject: warning on the current head
Next
From: Álvaro Herrera
Date:
Subject: Re: warning on the current head