Thread: Questions about logicalrep_worker_launch()

Questions about logicalrep_worker_launch()

From
Fujii Masao
Date:
Hi,

While reading the code of logicalrep_worker_launch(), I had two questions:

(1)
When the sync worker limit per subscription is reached, logicalrep_worker_launch()
runs garbage collection to try to free up slots before checking the limit again.
That makes sense.

But should we do the same when the parallel apply worker limit is reached?
Currently, if we've hit the parallel apply worker limit but not the sync worker limit
and we find an unused worker slot, garbage collection doesn't run. Would it
make sense to also run garbage collection in that case?

(2)
If garbage collection removes at least one worker, logicalrep_worker_launch()
scans all worker slots again to look for a free one. But since we know at least one
slot was freed, this retry might be unnecessary. We could just reuse the freed
slot directly. Is that correct?


The attached patch addresses both points. Since logicalrep_worker_launch()
isn't a performance-critical path, this might not be a high-priority change.
But if my understanding is correct, I'm a bit tempted to apply it as a refactoring.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Questions about logicalrep_worker_launch()

From
Masahiko Sawada
Date:
Hi,

On Fri, Apr 25, 2025 at 9:10 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> Hi,
>
> While reading the code of logicalrep_worker_launch(), I had two questions:
>
> (1)
> When the sync worker limit per subscription is reached, logicalrep_worker_launch()
> runs garbage collection to try to free up slots before checking the limit again.
> That makes sense.
>
> But should we do the same when the parallel apply worker limit is reached?
> Currently, if we've hit the parallel apply worker limit but not the sync worker limit
> and we find an unused worker slot, garbage collection doesn't run. Would it
> make sense to also run garbage collection in that case?

Good point. In that case, we would end up not being able to launch
parallel apply workers for the subscription until either we used up
all worker slots or reached the sync worker limit, which is bad.

>
> (2)
> If garbage collection removes at least one worker, logicalrep_worker_launch()
> scans all worker slots again to look for a free one. But since we know at least one
> slot was freed, this retry might be unnecessary. We could just reuse the freed
> slot directly. Is that correct?

Agreed. Since these codes are protected by LogicalRepWorkerLock in an
exclusive mode, we should be able to use the worker slot that we just
cleaned up.

> The attached patch addresses both points. Since logicalrep_worker_launch()
> isn't a performance-critical path, this might not be a high-priority change.
> But if my understanding is correct, I'm a bit tempted to apply it as a refactoring.

I agree with these changes.

I think that while the changes for (2) should be for v19, the changes
for (1) might be treated as a bug fix?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Questions about logicalrep_worker_launch()

From
Fujii Masao
Date:

On 2025/04/26 3:03, Masahiko Sawada wrote:
> I agree with these changes.
> 
> I think that while the changes for (2) should be for v19, the changes
> for (1) might be treated as a bug fix?

Agreed. I've split the patch into two parts:

0001 is for (1) and is a bug fix that should be back-patched to v16,
where parallel apply workers were introduced. Since it didn't apply
cleanly to v16, I also created a separate patch specifically for v16.

0002 is for (2) and is intended for v19.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Questions about logicalrep_worker_launch()

From
Amit Kapila
Date:
On Mon, Apr 28, 2025 at 2:37 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2025/04/26 3:03, Masahiko Sawada wrote:
> > I agree with these changes.
> >
> > I think that while the changes for (2) should be for v19, the changes
> > for (1) might be treated as a bug fix?
>
> Agreed. I've split the patch into two parts:
>
> 0001 is for (1) and is a bug fix that should be back-patched to v16,
> where parallel apply workers were introduced. Since it didn't apply
> cleanly to v16, I also created a separate patch specifically for v16.
>

The situation for parallel_apply workers is not as bad as for
tablesync workers because even if the worker for parallel apply is not
available, the apply worker will apply the changes by itself. OTOH, if
the tablesync worker is not available, the tablesync will be pending
till the time a worker for the same is not available. So, I am not
sure if this is a clear cut bug which requires a fix in backbranches.

Additionally, shall we try to reproduce this case for parallel apply workers?

--
With Regards,
Amit Kapila.



Re: Questions about logicalrep_worker_launch()

From
Fujii Masao
Date:

On 2025/04/29 21:21, Amit Kapila wrote:
> On Mon, Apr 28, 2025 at 2:37 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2025/04/26 3:03, Masahiko Sawada wrote:
>>> I agree with these changes.
>>>
>>> I think that while the changes for (2) should be for v19, the changes
>>> for (1) might be treated as a bug fix?
>>
>> Agreed. I've split the patch into two parts:
>>
>> 0001 is for (1) and is a bug fix that should be back-patched to v16,
>> where parallel apply workers were introduced. Since it didn't apply
>> cleanly to v16, I also created a separate patch specifically for v16.
>>
> 
> The situation for parallel_apply workers is not as bad as for
> tablesync workers because even if the worker for parallel apply is not
> available, the apply worker will apply the changes by itself. OTOH, if
> the tablesync worker is not available, the tablesync will be pending
> till the time a worker for the same is not available. So, I am not
> sure if this is a clear cut bug which requires a fix in backbranches.

I'm fine with treating this as an improvement rather than a bug fix.
In any case, I've registered the patches for the next CommitFest.

The attached patches are the same as before, just rebased for the master branch.

> 
> Additionally, shall we try to reproduce this case for parallel apply workers?

I noticed this issue while reading the code, so I haven't actually reproduced it.
Are you saying it's not possible to reproduce this in practice?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Questions about logicalrep_worker_launch()

From
Amit Kapila
Date:
On Thu, May 1, 2025 at 10:27 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> >
> > Additionally, shall we try to reproduce this case for parallel apply workers?
>
> I noticed this issue while reading the code, so I haven't actually reproduced it.
> Are you saying it's not possible to reproduce this in practice?
>

No, the idea was that if we are changing the code, it is better to
test it via a testcase instead of changing based on assumptions.

--
With Regards,
Amit Kapila.