Re: Support tid range scan in parallel? - Mailing list pgsql-hackers

From David Rowley
Subject Re: Support tid range scan in parallel?
Date
Msg-id CAApHDvoWXpDJU6R3A-Xv2MaEm4y233H_m0od6iqX+XGx3En2Vw@mail.gmail.com
Whole thread Raw
In response to Re: Support tid range scan in parallel?  (Cary Huang <cary.huang@highgo.ca>)
List pgsql-hackers
On Sat, 30 Aug 2025 at 05:32, Cary Huang <cary.huang@highgo.ca> wrote:
>
>  > The workers are ending their scan early because
>  > heap_getnextslot_tidrange() returns false on the first call from the
>  > parallel worker.

> Yes, the previous v9 patch missed setting node->trss_mintid and
> node->trss_maxtid, causing the parallel workers to exit early due to
> heap_getnextslot_tidrange() returning false.
>
> With the attached v10 patch, the parallel leader and workers now
> have to evaluate (TidRangeEval()) the given TID ranges and set them
> via ExecTidRangeScanInitializeDSM(),
> ExecTidRangeScanReInitializeDSM() and
> ExecTidRangeScanInitializeWorker().
>
> To prevent the leader and the workers from calling heap_setscanlimits()
> and trying to set phs_startblock and phs_numblock concurrently in
> shared memory, I added a condition to only allow parallel
> leader to set them. Since node->trss_mintid and node->trss_maxtid
> reside in local memory, the workers still have to call heap_set_tidrange
> () to have them set to return correct scan results:

I spent quite a bit of time looking at this. I didn't like the way
heap_setscanlimits() did:

+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())

as it wasn't clear to me that this didn't break completely when the
leader didn't make it there first.

I've made quite a few revisions to the v10 patch, which I've attached
as 11-0002. v11-0001 your v10 rebased atop of master.

Here's a summary of the changes:

1. Moved block limiting logic for parallel scans into
table_block_parallelscan_startblock_init(). There's currently a lock
here to ensure only 1 worker can set the shared memory fields at a
time. I've hooked into the same lock to set the startblock and
numblocks.
2. Fixed chunk size ramp-down code which is meant to divvy up the scan
into smaller and smaller chunks as it nears completion so that one
worker doesn't get left with too much work and leave the others with
nothing.  That code still thought that it was scanning every block in
the table.
3. Changed things around so that the min/max TID for the parallel scan
is specified via table_rescan_tidrange(). This means zero changes to
TidRangeNext, and the only additions to nodeTidrangescan.c are for the
shared memory handling.
4. The rest of the changes are mostly cosmetic.

With this version table_block_parallelscan_startblock_init() has grown
2 extra fields. I considered we should instead rename this function
append "_with_limit" to its name then add another function with the
original name that calls the renamed function passing in
InvalidBlockNumber for both new fields. I didn't do that as we only
have a single call to the existing function, so doing that would only
be for the benefit of extensions that happen to use that function. It
doesn't seem overly difficult for them to adjust their code. I didn't
find any code using that function in codesearch.debian.net.

I still need to do a bit more testing on this, but in the meantime
thought I'd share what I've done with it so that other people can look
in parallel.

David

Attachment

pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Logical Replication of sequences
Next
From: Masahiko Sawada
Date:
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()