Thread: Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Álvaro Herrera
Date:
On 2024-Mar-25, Alexander Korotkov wrote:

> reindexdb: Add the index-level REINDEX with multiple jobs
> 
> Straight-forward index-level REINDEX is not supported with multiple jobs as
> we cannot control the concurrent processing of multiple indexes depending on
> the same relation.  Instead, we dedicate the whole table to certain reindex
> job.  Thus, if indexes in the lists belong to different tables, that gives us
> a fair level of parallelism.

I tested this, because of a refactoring suggestion [1] and I find that
it's rather completely broken.  I ran this to setup a bunch of tables
that I'd want reindexed in parallel:

create table foo (a int);
insert into foo select * from generate_series(1, (10^7)::numeric);
create index foo1 on foo (a);
create index foo2 on foo (a);
create table bar (a int);
insert into bar select * from generate_series(1, (2 * (10^7))::numeric);
create index bar1 on bar (a);
create index bar2 on bar (a);
create table baz (a int);
insert into baz select * from generate_series(1, (5 * (10^6))::numeric);
create index baz1 on baz (a);
create index baz2 on baz (a);
create index baz3 on baz (a);
create index baz4 on baz (a);

I then run this:
reindexdb -j4 --echo -i foo1 -i foo2 -i bar1 -i bar2 -i baz1 -i baz2 -i baz3 -i baz4 | grep REINDEX

Looking at active processes with psql's \watch during the run, I learn
that what happens is that we process the indexes on baz first, without
any other process in parallel, until we get to the last one of baz
table, and we start processing one from baz and one from foo in
parallel.  But when the one in baz is done, we only continue with one
process until the list reaches indexes of 'bar', and we process two in
parallel, and then we ran out of indexes in foo so we complete without
any more paralellism.

This is a waste and surely not what was intended: surely what we want is
that given that we have more parallel jobs available than there are
tables, we would start processing the first index of each table at
roughly the same time, namely right at program start.  That would keep
three processes occupied until we ran out of indexes on one table, so
we'd keep two processes occupied, and so on.  But this is not what
happens.

Am I misunderstanding something?

[1] https://postgr.es/m/CAEudQApP=u5-9PR_fs1DpZToQNrtTFSP+_fjrOgfi73UkrBXKQ@mail.gmail.com
 
-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Álvaro Herrera
Date:
On 2025-Mar-07, Álvaro Herrera wrote:

> I tested this, because of a refactoring suggestion [1] and I find that
> it's rather completely broken.

I think we need significantly more complex scheduling code if we want
this to actually work, possibly even having to hack the ParallelSlot
API some, so that we can inspect which tables have a running reindex and
know not to schedule the next one on it.  What we're doing now makes no
sense.

We should strike this out from the list of features of 17 and revert
this commit.

If we want this feature in 19, we need another go through the drawing
board.  (There's clearly not enough time to do it for 18.)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Alexander Korotkov
Date:
On Sat, Mar 8, 2025 at 11:57 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2025-Mar-07, Álvaro Herrera wrote:
>
> > I tested this, because of a refactoring suggestion [1] and I find that
> > it's rather completely broken.
>
> I think we need significantly more complex scheduling code if we want
> this to actually work, possibly even having to hack the ParallelSlot
> API some, so that we can inspect which tables have a running reindex and
> know not to schedule the next one on it.  What we're doing now makes no
> sense.
>
> We should strike this out from the list of features of 17 and revert
> this commit.
>
> If we want this feature in 19, we need another go through the drawing
> board.  (There's clearly not enough time to do it for 18.)

Yes, I also think we need to revert this from 17.  One thing to care
about: it might be already used in some user scripts.  Should we
replace pg_fatal() with some notice and then run in a single job?  So,
user scripts wouldn't error out.

------
Regards,
Alexander Korotkov
Supabase



Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Alexander Korotkov
Date:
On Fri, Mar 7, 2025 at 8:20 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Mar-25, Alexander Korotkov wrote:
>
> > reindexdb: Add the index-level REINDEX with multiple jobs
> >
> > Straight-forward index-level REINDEX is not supported with multiple jobs as
> > we cannot control the concurrent processing of multiple indexes depending on
> > the same relation.  Instead, we dedicate the whole table to certain reindex
> > job.  Thus, if indexes in the lists belong to different tables, that gives us
> > a fair level of parallelism.
>
> I tested this, because of a refactoring suggestion [1] and I find that
> it's rather completely broken.

The code was written with assumption that running
run_reindex_command() with async == true can schedule a number of
queries for a connection.  But actually that's not true and everything
is broken.

------
Regards,
Alexander Korotkov
Supabase



Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Alexander Korotkov
Date:
On Sat, Mar 8, 2025 at 12:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Fri, Mar 7, 2025 at 8:20 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2024-Mar-25, Alexander Korotkov wrote:
> >
> > > reindexdb: Add the index-level REINDEX with multiple jobs
> > >
> > > Straight-forward index-level REINDEX is not supported with multiple jobs as
> > > we cannot control the concurrent processing of multiple indexes depending on
> > > the same relation.  Instead, we dedicate the whole table to certain reindex
> > > job.  Thus, if indexes in the lists belong to different tables, that gives us
> > > a fair level of parallelism.
> >
> > I tested this, because of a refactoring suggestion [1] and I find that
> > it's rather completely broken.
>
> The code was written with assumption that running
> run_reindex_command() with async == true can schedule a number of
> queries for a connection.  But actually that's not true and everything
> is broken.

The draft patch for revert is attached.  Could you, please, check.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Alexander Korotkov
Date:
On Sun, Mar 9, 2025 at 4:53 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Mar 8, 2025 at 12:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Fri, Mar 7, 2025 at 8:20 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > On 2024-Mar-25, Alexander Korotkov wrote:
> > >
> > > > reindexdb: Add the index-level REINDEX with multiple jobs
> > > >
> > > > Straight-forward index-level REINDEX is not supported with multiple jobs as
> > > > we cannot control the concurrent processing of multiple indexes depending on
> > > > the same relation.  Instead, we dedicate the whole table to certain reindex
> > > > job.  Thus, if indexes in the lists belong to different tables, that gives us
> > > > a fair level of parallelism.
> > >
> > > I tested this, because of a refactoring suggestion [1] and I find that
> > > it's rather completely broken.
> >
> > The code was written with assumption that running
> > run_reindex_command() with async == true can schedule a number of
> > queries for a connection.  But actually that's not true and everything
> > is broken.
>
> The draft patch for revert is attached.  Could you, please, check.

After second thought it's not so hard to fix.  The attached patch does
it by putting REINDEX commands related to the same table into a single
SQL statement.  Possibly, that could be better than revert given we
need to handle compatibility.  What do you think?

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Álvaro Herrera
Date:
On 2025-Mar-09, Alexander Korotkov wrote:

> After second thought it's not so hard to fix.  The attached patch does
> it by putting REINDEX commands related to the same table into a single
> SQL statement.  Possibly, that could be better than revert given we
> need to handle compatibility.  What do you think?

Oh, this is an egg of Columbus solution, I like it.  It seems to work as
intended on a quick test.  Please add some commentary to run_reindex_command.

Maybe another, possibly better way to do this would be to use libpq
pipeline mode, sending all the commands for a table in one pipeline
instead of a single command.  The advantage of this would be that
server-side log entries for each command would be separate, and they
wouldn't appear clumped together in pg_stat_activity and so on.
However, this would require more invasive changes, so it might be better
to leave that for a future project -- it's certainly a harder sell for
such a change to be backpatched.  So I'm +1 on your current patch for 17
and master.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"



Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Alexander Korotkov
Date:
On Mon, Mar 10, 2025 at 1:12 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2025-Mar-09, Alexander Korotkov wrote:
>
> > After second thought it's not so hard to fix.  The attached patch does
> > it by putting REINDEX commands related to the same table into a single
> > SQL statement.  Possibly, that could be better than revert given we
> > need to handle compatibility.  What do you think?
>
> Oh, this is an egg of Columbus solution, I like it.  It seems to work as
> intended on a quick test.  Please add some commentary to run_reindex_command.
>
> Maybe another, possibly better way to do this would be to use libpq
> pipeline mode, sending all the commands for a table in one pipeline
> instead of a single command.  The advantage of this would be that
> server-side log entries for each command would be separate, and they
> wouldn't appear clumped together in pg_stat_activity and so on.
> However, this would require more invasive changes, so it might be better
> to leave that for a future project -- it's certainly a harder sell for
> such a change to be backpatched.  So I'm +1 on your current patch for 17
> and master.

Thank you for your feedback!  I also think that pipelining would be a
better options, but it's too invasive for backpatching. I've written
comments for gen_reindex_command() and run_reindex_command().  I'm
going to push this patch to master and 17 if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From
Álvaro Herrera
Date:
On 2025-Mar-11, Alexander Korotkov wrote:

> Thank you for your feedback!  I also think that pipelining would be a
> better options, but it's too invasive for backpatching. I've written
> comments for gen_reindex_command() and run_reindex_command().  I'm
> going to push this patch to master and 17 if no objections.

Looks good.

Actually, I don't understand why is run_reindex_command() calling
PQfinish().  Hmm, also, why is the 'async' parameter always given as
true?  Is that PQfinish() actually dead code?

[... Looks at git history ...]

Ah!  this is actually a problem older than your patch -- it appears that
the last (only?) caller with async=false was removed by commit
2cbc3c17a5c1.  I think it's worth removing that while you're at it.  We
no longer need executeMaintenanceCommand() in reindexdb.c anymore
either, I think.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/