Optimize SnapBuild by maintaining committed.xip in sorted order - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Optimize SnapBuild by maintaining committed.xip in sorted order
Date
Msg-id CABPTF7XiwbA38OZBj5Y2F-q+fZ=03YFN9iFnb_406F4SfE-f4w@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi hackers,

I'd like to propose an optimization for logical decoding's snapshot building
mechanism that eliminates repeated sorting overhead once a replication slot
reaches the CONSISTENT state.

1) Problem

Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
every call using qsort(). Once a logical decoding slot reaches CONSISTENT
state, snapshots are built frequently—after every catalog-modifying transaction
commit. This repeated O(n log n) sorting becomes a performance bottleneck as
the committed transaction array grows.

The existing code even has a TODO comment acknowledging this inefficiency:
"TODO: It's unclear whether that reasoning has much merit. Every time we add
something here after becoming consistent will also require distributing a
snapshot."

2) Proposed Solution

Maintain sorted order via batch merge on each commit:

1. Collect all relevant XIDs (including subtransactions) into a batch array
2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
3. Merge into the global array: O(n + m) via reverse in-place merge

While per-commit cost increases from O(m) to O(m log m + n), this is offset by
eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
builds snapshots after each catalog-modifying commit, the amortized cost is
significantly lower.

3) Performance Impact

Expected improvements in CONSISTENT state:
- Eliminates O(n log n) qsort on every snapshot build
- Replaces with O(m log m + n) merge per commit
- For typical cases where m << n and snapshots are frequent, this is a win


The benchmark results for this patch are shown below.

4) Configuration

shared_buffers = '4GB'
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
log_min_messages = warning
max_connections = 600
autovacuum = off
checkpoint_timeout = 15min
max_wal_size = 4GB


5) Workloads

# Workload 1: DDL - High frequency, small commits
create_ddl_workload() {
  local ROOT="$1"
  cat >"$ROOT/ddl.sql" <<'SQL'
-- High-frequency small catalog commits
-- Each commit triggers SnapBuildBuildSnapshot
DO $$
DECLARE
  tbl text := format('temp_%s_%s',
                     current_setting('application_name', true),
                     floor(random()*1e9)::int);
BEGIN
  EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
DROP', tbl);
  EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
  EXECUTE format('SELECT * FROM %I', tbl);
END$$;
SQL
}

# Workload 2: MIXED - mix of DDL and DML, 50%-50%
create_mixed_workload() {
  local ROOT="$1"
  cat >"$ROOT/mixed_ddl.sql" <<'SQL'
-- DDL workload (catalog changes)
DO $$
DECLARE
  tbl text := format('t_%s_%s',
                     current_setting('application_name', true),
                     floor(random()*1e9)::int);
BEGIN
  EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
  EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
END$$;

SQL
  cat >"$ROOT/mixed_dml.sql" <<'SQL'
-- DML workload (no catalog changes)
INSERT INTO app_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}

# Workload 3: CONTROL - Pure DML, no catalog changes
create_control_workload() {
  local ROOT="$1"
  cat >"$ROOT/control.sql" <<'SQL'

-- Pure DML, no catalog changes
-- Should show no difference between baseline and patched
INSERT INTO control_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}


6) Test strategy

Clients: 100 concurrent connections per workload
Duration: 40 seconds per run
Repetitions: 1 run per workload type
Replication:  Logical replication slot using `test_decoding` plugin
with `EXPORT_SNAPSHOT=true`

Workload Types:
1. DDL - Pure catalog churn (temp table create/drop)
2. Mixed- 50% DDL + 50% DML (workload)
3. Control - Pure DML (no catalog changes)

Measurement:
- Metrics captured from pg_stat_replication_slots before/after each run
- Primary metrics: total_txns (transactions decoded) and total_bytes
(data volume)
- Compared baseline (vanilla PostgreSQL) vs patched (sorted
committed.xip optimization)


7) Performance results

DDL Workload: +235% Decoder Improvement
Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
Decode efficiency:   9.46% → 33.29%             (+23.83 points)

Mixed Workload: No Change
Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
Decode efficiency:   40.47% → 40.47%             (unchanged)

We can see that the qsort overhead in SnapBuild has been eliminated in
the flamegraphs in the mixed workload. However, the performance
improvement was not observed as in the DDL workload. My guess is that,
for DML workloads, ReorderBufferApplyChange has become the new
hotspot.

DML Workload: No Change
Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
Decode efficiency:   49.97% → 49.97%             (unchanged)

=== Workload: ddl ===
Client commits/sec:
  Baseline:  7545.76 commits/sec
  Patched:   7198.21 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  713.76 txns/sec  (672.67 MB/s)
  Patched:   2396.52 txns/sec  (1747.22 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  309376 committed  →  29264 decoded  (9.46%)
  Patched:   302325 committed  →  100654 decoded  (33.29%)

Total decoded (all reps):
  Baseline:  29264 txns  (27579.53 MB)
  Patched:   100654 txns  (73383.10 MB)

  Decoder improvement: +235.00% (txns/sec)
  Decoder improvement: +159.00% (MB/s)
  Efficiency improvement: +23.83% points (more transactions decoded
per committed)

=== Workload: mixed ===
Client commits/sec:
  Baseline:  6797.50 commits/sec
  Patched:   6745.26 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  2751.10 txns/sec  (210.35 MB/s)
  Patched:   2730.00 txns/sec  (205.64 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  285495 committed  →  115546 decoded  (40.47%)
  Patched:   283301 committed  →  114660 decoded  (40.47%)

Total decoded (all reps):
  Baseline:  115546 txns  (8834.71 MB)
  Patched:   114660 txns  (8636.96 MB)

  ≈  Decoder unchanged: 0.00% (txns/sec)

=== Workload: DML ===
Client commits/sec:
  Baseline:  6129.24 commits/sec
  Patched:   6136.93 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  3062.57 txns/sec  (0.26 MB/s)
  Patched:   3066.37 txns/sec  (0.26 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  257428 committed  →  128628 decoded  (49.97%)
  Patched:   251614 committed  →  125721 decoded  (49.97%)

Total decoded (all reps):
  Baseline:  128628 txns  (10.98 MB)
  Patched:   125721 txns  (10.69 MB)

  ≈  Decoder unchanged: 0.00% (txns/sec)

8) Potential regression

The potential regression point could be before the slot reaches the
CONSISTENT state, particularly when building_full_snapshot is set to
true. In this phase, all transactions including those that don’t
modify the catalog — must be added to the committed.xip array. These
XIDs don’t require later snapshot builds or sorting, so the
batch-insert logic increases the per-insert cost from O(1) to O(m + n)
without providing a direct benefit.

However, the impact of this regression could be limited. The system
remains in the pre-CONSISTENT phase only briefly during initial
snapshot building, and the building_full_snapshot = true case is rare,
mainly used when creating replication slots with the EXPORT_SNAPSHOT
option.

Once the slot becomes CONSISTENT, only catalog-modifying transactions
are tracked in committed.xip, and the patch reduces overall
snapshot-building overhead by eliminating repeated full-array sorts.

We could also adopt a two-phase approach — keeping the current
behavior before reaching the CONSISTENT state and maintaining a sorted
array only after that point. This would preserve the performance
benefits while avoiding potential regressions. However, it would
introduce additional complexity and potential risks in handling the
state transitions.

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->start_decoding_at <= lsn)
builder->start_decoding_at = lsn + 1;

/*
* If building an exportable snapshot, force xid to be tracked, even
* if the transaction didn't modify the catalog.
*/
if (builder->building_full_snapshot)
{
needs_timetravel = true;
}
}

9) Additional benefit
With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
two binary searchs rather than interating and copying to find the interval
to keep in the sorted commited.xip array. [1]

Feedbacks welcomed.

[1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com

Best,
Xuneng

Attachment

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Calling PGReserveSemaphores() from CreateOrAttachShmemStructs
Next
From: Xuneng Zhou
Date:
Subject: Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array