Thread: BUG #18630: Incorrect memory access inside ReindexIsProcessingIndex() on VACUUM

BUG #18630: Incorrect memory access inside ReindexIsProcessingIndex() on VACUUM

From

PG Bug reporting form

Date:

25 September 2024, 06:00:00

The following bug has been logged on the website:

Bug reference:      18630
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 17rc1
Operating system:   Ubuntu 22.04
Description:

The following script:
psql -c "SELECT pg_sleep(5)" &

echo "
SET lock_timeout = '3s';
CREATE TABLE t(i int, t text);
REINDEX TABLE CONCURRENTLY t;
SELECT pg_sleep(3);
" | psql

psql -c "VACUUM (PROCESS_MAIN FALSE, FULL TRUE) t;"

produces:
WARNING:  cannot reindex invalid index "pg_toast.pg_toast_16384_index_ccnew"
on TOAST table, skipping

and then a Valgrind-detected error:
==00:00:00:10.727 3193327== Invalid read of size 4
==00:00:00:10.727 3193327==    at 0x5A6D80: list_member_oid (list.c:726)
==00:00:00:10.727 3193327==    by 0x33FE2F: ReindexIsProcessingIndex
(index.c:4083)
==00:00:00:10.727 3193327==    by 0x27B43F: systable_beginscan
(genam.c:396)
==00:00:00:10.727 3193327==    by 0x4CE8F9: vac_update_datfrozenxid
(vacuum.c:1723)
==00:00:00:10.727 3193327==    by 0x4CCFAB: vacuum (vacuum.c:691)
==00:00:00:10.727 3193327==    by 0x4CC910: ExecVacuum (vacuum.c:449)
==00:00:00:10.727 3193327==    by 0x7CE082: standard_ProcessUtility
(utility.c:859)
==00:00:00:10.727 3193327==    by 0x7CD61D: ProcessUtility (utility.c:523)
==00:00:00:10.727 3193327==    by 0x7CBE98: PortalRunUtility
(pquery.c:1158)
==00:00:00:10.727 3193327==    by 0x7CC10F: PortalRunMulti (pquery.c:1316)
==00:00:00:10.727 3193327==    by 0x7CB559: PortalRun (pquery.c:791)
==00:00:00:10.727 3193327==    by 0x7C3C7A: exec_simple_query
(postgres.c:1284)
==00:00:00:10.727 3193327==  Address 0x72f4878 is 7,496 bytes inside a
recently re-allocated block of size 8,192 alloc'd
==00:00:00:10.727 3193327==    at 0x4848899: malloc
(vg_replace_malloc.c:381)
==00:00:00:10.727 3193327==    by 0x9FDA95: AllocSetContextCreateInternal
(aset.c:444)
==00:00:00:10.727 3193327==    by 0x2E0BBC: AtStart_Memory (xact.c:1206)
==00:00:00:10.727 3193327==    by 0x2E1C56: StartTransaction (xact.c:2143)
==00:00:00:10.727 3193327==    by 0x2E2CA8: StartTransactionCommand
(xact.c:3050)
==00:00:00:10.727 3193327==    by 0x9DF444: InitPostgres (postinit.c:830)
==00:00:00:10.727 3193327==    by 0x7C8B3A: PostgresMain (postgres.c:4349)
==00:00:00:10.727 3193327==    by 0x7BF5AE: BackendMain
(backend_startup.c:107)
==00:00:00:10.727 3193327==    by 0x6D1E75: postmaster_child_launch
(launch_backend.c:274)
==00:00:00:10.727 3193327==    by 0x6D7CE8: BackendStartup
(postmaster.c:3420)
==00:00:00:10.727 3193327==    by 0x6D539A: ServerLoop (postmaster.c:1653)
==00:00:00:10.727 3193327==    by 0x6D4C92: PostmasterMain
(postmaster.c:1351)
==00:00:00:10.727 3193327==
...
2024-09-25 02:44:16.496 UTC|||66f378f6.30b9b3|LOG:  server process (PID
3193327) exited with exit code 1
2024-09-25 02:44:16.496 UTC|||66f378f6.30b9b3|DETAIL:  Failed process was
running: VACUUM (PROCESS_MAIN FALSE, FULL TRUE) t;

or an assertion failure (when executed without Valgrind):
TRAP: failed Assert("IsOidList(list)"), File: "list.c", Line: 726, PID:
3213057

Reproduced on REL_16_STABLE (starting from 4211fbd84) .. master.

Re: BUG #18630: Incorrect memory access inside ReindexIsProcessingIndex() on VACUUM

From

Tender Wang

Date:

25 September 2024, 13:19:20

Tender Wang <tndrwang@gmail.com> 于2024年9月25日周三 17:28写道：

PG Bug reporting form <noreply@postgresql.org> 于2024年9月25日周三 13:35写道：
The following bug has been logged on the website:

Bug reference: 18630
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 17rc1
Operating system: Ubuntu 22.04
Description:

The following script:
psql -c "SELECT pg_sleep(5)" &

echo "
SET lock_timeout = '3s';
CREATE TABLE t(i int, t text);
REINDEX TABLE CONCURRENTLY t;
SELECT pg_sleep(3);
" | psql

psql -c "VACUUM (PROCESS_MAIN FALSE, FULL TRUE) t;"

produces:
WARNING: cannot reindex invalid index "pg_toast.pg_toast_16384_index_ccnew"
on TOAST table, skipping

and then a Valgrind-detected error:
==00:00:00:10.727 3193327== Invalid read of size 4
==00:00:00:10.727 3193327== at 0x5A6D80: list_member_oid (list.c:726)
==00:00:00:10.727 3193327== by 0x33FE2F: ReindexIsProcessingIndex
(index.c:4083)
==00:00:00:10.727 3193327== by 0x27B43F: systable_beginscan
(genam.c:396)
==00:00:00:10.727 3193327== by 0x4CE8F9: vac_update_datfrozenxid
(vacuum.c:1723)
==00:00:00:10.727 3193327== by 0x4CCFAB: vacuum (vacuum.c:691)
==00:00:00:10.727 3193327== by 0x4CC910: ExecVacuum (vacuum.c:449)
==00:00:00:10.727 3193327== by 0x7CE082: standard_ProcessUtility
(utility.c:859)
==00:00:00:10.727 3193327== by 0x7CD61D: ProcessUtility (utility.c:523)
==00:00:00:10.727 3193327== by 0x7CBE98: PortalRunUtility
(pquery.c:1158)
==00:00:00:10.727 3193327== by 0x7CC10F: PortalRunMulti (pquery.c:1316)
==00:00:00:10.727 3193327== by 0x7CB559: PortalRun (pquery.c:791)
==00:00:00:10.727 3193327== by 0x7C3C7A: exec_simple_query
(postgres.c:1284)
==00:00:00:10.727 3193327== Address 0x72f4878 is 7,496 bytes inside a
recently re-allocated block of size 8,192 alloc'd
==00:00:00:10.727 3193327== at 0x4848899: malloc
(vg_replace_malloc.c:381)
==00:00:00:10.727 3193327== by 0x9FDA95: AllocSetContextCreateInternal
(aset.c:444)
==00:00:00:10.727 3193327== by 0x2E0BBC: AtStart_Memory (xact.c:1206)
==00:00:00:10.727 3193327== by 0x2E1C56: StartTransaction (xact.c:2143)
==00:00:00:10.727 3193327== by 0x2E2CA8: StartTransactionCommand
(xact.c:3050)
==00:00:00:10.727 3193327== by 0x9DF444: InitPostgres (postinit.c:830)
==00:00:00:10.727 3193327== by 0x7C8B3A: PostgresMain (postgres.c:4349)
==00:00:00:10.727 3193327== by 0x7BF5AE: BackendMain
(backend_startup.c:107)
==00:00:00:10.727 3193327== by 0x6D1E75: postmaster_child_launch
(launch_backend.c:274)
==00:00:00:10.727 3193327== by 0x6D7CE8: BackendStartup
(postmaster.c:3420)
==00:00:00:10.727 3193327== by 0x6D539A: ServerLoop (postmaster.c:1653)
==00:00:00:10.727 3193327== by 0x6D4C92: PostmasterMain
(postmaster.c:1351)
==00:00:00:10.727 3193327==
...
2024-09-25 02:44:16.496 UTC|||66f378f6.30b9b3|LOG: server process (PID
3193327) exited with exit code 1
2024-09-25 02:44:16.496 UTC|||66f378f6.30b9b3|DETAIL: Failed process was
running: VACUUM (PROCESS_MAIN FALSE, FULL TRUE) t;

or an assertion failure (when executed without Valgrind):
TRAP: failed Assert("IsOidList(list)"), File: "list.c", Line: 726, PID:
3213057

Reproduced on REL_16_STABLE (starting from 4211fbd84) .. master.

Thanks for reporting. I can reproduce this issue.

When this statement "REINDEX TABLE CONCURRENTLY t;" failed because of lock timeout.
If we do vacuum like this case, when we do toast_relid of table t, we will get two index oids.
pg_toast.pg_toast_16384_index_ccnew is invalid because the REINDEX failed.

Now we only report warings in reindex_relation(). The pg_toast.pg_toast_16384_index_ccnew is
still on the pendingReindexedIndexes list. After finishing the toast_rel of table t, the transatiocn committed,
and the memory of pendingReindexedIndexes was reset but not NIL. So it will trigger assert failure when calling ReindexIsProcessingIndex().

I think we can remove the invalid index oid from the pendingReindexedIndexes instead of reporting warning.
I try this way, and no assert failure again. See the attached patch.

Adding if (flags & REINDEX_REL_SUPPRESS_INDEX_USE) check before calling RemoveReindexPending() may look better.

Any thoughts?

Thanks,

Tender Wang

https://www.openpie.com/

Re: BUG #18630: Incorrect memory access inside ReindexIsProcessingIndex() on VACUUM

From

Tender Wang

Date:

26 September 2024, 10:24:25

Michael Paquier <michael@paquier.xyz> 于2024年9月26日周四 15:02写道：

On Wed, Sep 25, 2024 at 06:19:20PM +0800, Tender Wang wrote:
> Adding if (flags & REINDEX_REL_SUPPRESS_INDEX_USE) check before calling
> RemoveReindexPending() may look better.
> Any thoughts?

The short answer to that is yes, see below.

Alexander's trick presented upthread is funky. I was first confused
of what was the point until I noticed that this is just a way to make
REINDEX CONCURRENTLY fail and produce an invalid toast index. The
first pg_sleep causes REINDEX to fail because we are waiting for the
first transaction to complete. So this provides a controlled way to
get an index that should never be part of the list reported to
SetReindexPending() to begin with. The VACUUM (PROCESS_MAIN FALSE,
FULL) is then disturbed.

Anyway, PROCESS_MAIN is not related to the failure, as it is just a
flavor grammar introduced for the sake of making toast-only rebuilds
easier for administrators. The same failure reproduces if switching
the test case of upthread to do the REINDEX on the toast table, down
to 12 which is as far as I have tested because that's what we support
around here.

Regarding the fix, we should not disturb the list of indexes in a
relation returned by relcache.c, and invalid indexes are part of it.
So I'd agree with your point to just remove the index from the pending
list because we have to skip invalid toast indexes in the
reindex_relation() path as reindex_index() has to generate a hard
failure because we can never ever have two valid toast indexes, and
that's a guarantee we need to be very careful about.

Now, about REINDEX_REL_SUPPRESS_INDEX_USE. Well, that's only used for
the VACUUM FULL/CLUSTER path which is where the indexes are marked as
being processed, so we have to reomve the invalid toast indexes from
the list only if this flag is used. Hence the answer is yes. I'd
rather keep the informative warning. That's useful for monitoring
purposes.

Agree. The attached patch LGTM.

Something worth noting while I have looked at this code.. Non-toast
invalid indexes are rebuilt but still marked as !indisvalid, actually.
That's a waste and we could just skip them for a CLUSTER/VACUUM FULL.
This cannot be true for a REINDEX TABLE, invalid indexes are switched
to valid in this case.

Thanks,

Tender Wang

https://www.openpie.com/