Re: Adding REPACK [concurrently] - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: Adding REPACK [concurrently]
Date
Msg-id 35686.1768495019@localhost
Whole thread Raw
In response to Re: Adding REPACK [concurrently]  (Mihail Nikalayeu <mihailnikalayeu@gmail.com>)
Responses Re: Adding REPACK [concurrently]
List pgsql-hackers
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

> Also, there are some crashes of stress tests for v30 (for both single snapshot and multiple snapshot versions).
>
> ---------------------
>
> Looks like something is leaking, but not sure.
>
> https://cirrus-ci.com/task/5577209672368128?logs=test_world#L277 (multiple snapshots)
> https://cirrus-ci.com/task/6439044873191424 (without multiple snapshots)

As the test runs pgbench with --client=30 and the default value of
max_worker_processes is 8, I'm not sure this is a leak. I've increased this
parameter I couldn't see the error anymore.

> This one showed something goes wrong, the sum of the table is broken. It may be 0 because non-MVCC safe, but I
checkedthe logs: 
>
> 2026-01-12 18:41:11.656 UTC client backend[76247] 007_repack_concurrently.pl LOG:  statement: SELECT (490588) / 0;

I agree that this is due to the missing MVCC safety feature. I commented that
check in the script for now.

Besides that, I saw some deadlocks. I think this was due to the fact that
multiple rows are updated per transaction, and that the keys are random, so it
can happen that two transactions try to update the same rows in different
order. I increased the number of rows in the test table to 10000 and don't see
the deadlocks anymore.

> backend[54349] 007_repack_concurrently.pl ERROR:  could not create unique index "tbl_pkey_repacknew"
> 2026-01-12 18:41:12.477 UTC client backend[54349] 007_repack_concurrently.pl DETAIL:  Key (i)=(942) is duplicated.
> 2026-01-12 18:41:12.477 UTC client backend[54349] 007_repack_concurrently.pl STATEMENT:  REPACK (CONCURRENTLY) tbl;

This is tricky. I could reproduce the problem on my FreeBSD box a few times,
never on Linux (no idea if the OS makes the difference since HW is also quite
different, but CI also seemed to fail more often on FreeBSD.)

Something seems to be wrong about UPDATE, but I'm failing to understand how it
could relate to REPACK. This is an example of a duplicate value i=6118

SELECT i, j, xmin, xmax, ctid FROM tbl WHERE i=6118;
  i   |   j    |  xmin  |  xmax  |  ctid
------+--------+--------+--------+---------
 6118 | 445435 | 102317 | 103702 | (1,216)
 6118 | 391135 | 103702 |      0 | (56,62)

According to log, xid=102317 is the transaction used by REPACK and xid=103702
one of the test. pageinspect shows that the old version has not only
HEAP_XMIN_COMMITTED in t_infomask, but also HEAP_XMAX_INVALID.

So far I could not reproduce the duplicities with the REPACK (CONCURRENTLY)
command commented out in the test script, but that does not prove much (even
with REPACK, not every run fails). Also I noticed that REPACK incorrectly sets
cmin/cmax to 1 instead of 0 and it needs to be fixed, but I have no idea why
this bug should cause exactly this weird behavior.

I even added quite a few logging messages to reveal where in the code the
HEAP_XMAX_INVALID flag is set for particular ctid, but after a failure I could
not find the message for the problematic tuples. Ideas are appreciated.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: refactor architecture-specific popcount code
Next
From: Andres Freund
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)