Thread: Conflicting updates of command progress
While working on [1] I realized that some field of pg_stat_progress_cluste has weird value. On closer look I found out that cluster_rel() indirectly calls index_build() and that overwrites the progress parameter. Following are the parameter values in the master branch: /* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */ #define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1 #define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2 #define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3 #define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4 #define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5 #define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6 #define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7 /* Progress parameters for CREATE INDEX */ /* 3, 4 and 5 reserved for "waitfor" metrics */ #define PROGRESS_CREATEIDX_COMMAND 0 #define PROGRESS_CREATEIDX_INDEX_OID 6 #define PROGRESS_CREATEIDX_ACCESS_METHOD_OID 8 #define PROGRESS_CREATEIDX_PHASE 9 /* AM-agnostic phase # */ #define PROGRESS_CREATEIDX_SUBPHASE 10 /* phase # filled by AM */ #define PROGRESS_CREATEIDX_TUPLES_TOTAL 11 #define PROGRESS_CREATEIDX_TUPLES_DONE 12 #define PROGRESS_CREATEIDX_PARTITIONS_TOTAL 13 #define PROGRESS_CREATEIDX_PARTITIONS_DONE 14 /* 15 and 16 reserved for "block number" metrics */ The only conflicting parameters I see here are PROGRESS_CLUSTER_PHASE_REBUILD_INDEX vs PROGRESS_CREATEIDX_INDEX_OID (number 6). Fortunately, index_build() does not set PROGRESS_CREATEIDX_INDEX_OID so there's no live bug. However [1] adds some PROGRESS_CLUSTER_ parameters, thus making conflicts realistic. AFAICS the current design does not consider that one progress-reporting command can be called by another one. Not sure what the correct fix is. We can either ignore update requests from the "nested" commands, or display the progress of both. The latter approach is probably more invasive - is that worth the effort? [1] https://commitfest.postgresql.org/patch/5117/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
> While working on [1] I realized that some field of pg_stat_progress_cluste has > weird value. Is there a repro that you can share that shows the weird values? It sounds like the repro is on top of [1]. Is that right? > AFAICS the current design does not consider that one progress-reporting > command can be called by another one. pgstat_progress_start_command should only be called once by the entry point for the command. In theory, we could end up in a situation where start_command is called multiple times during the same top-level command; > Not sure what the correct fix is. We can > either ignore update requests from the "nested" commands, or display the There is a pattern where we do ``` if (progress) pgstat_progress_update_param ``` cluster_rel can pass down a flag to index_build or others that update progress to not report progress. Therefore, only the top level command is updating progress. what do you think? [1] https://commitfest.postgresql.org/patch/5117/ -- Sami Imseih Amazon Web Services (AWS)
On 2025/04/15 2:13, Sami Imseih wrote: >> While working on [1] I realized that some field of pg_stat_progress_cluste has >> weird value. I ran into the same issue while working on [2], and eventually had to withdraw that patch because of it. > Is there a repro that you can share that shows the weird values? It sounds like > the repro is on top of [1]. Is that right? You can reproduce the similar problem by creating a trigger function that runs a progress-reporting command like COPY, and then COPY data into a table that uses that trigger. Regards, [2] https://commitfest.postgresql.org/patch/5282/ -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Sami Imseih <samimseih@gmail.com> wrote: > > While working on [1] I realized that some field of pg_stat_progress_cluste has > > weird value. > > Is there a repro that you can share that shows the weird values? It sounds like > the repro is on top of [1]. Is that right? Yes. > > AFAICS the current design does not consider that one progress-reporting > > command can be called by another one. > > pgstat_progress_start_command should only be called once by the entry > point for the > command. In theory, we could end up in a situation where start_command > is called multiple times during the same top-level command; Not only in theory - it actually happens when CLUSTER is rebuilding indexes. > > Not sure what the correct fix is. We can > > either ignore update requests from the "nested" commands, or display the > > There is a pattern where we do > > ``` > if (progress) > pgstat_progress_update_param > ``` > > cluster_rel can pass down a flag to index_build or others that update progress > to not report progress. Therefore, only the top level command is > updating progress. > what do you think? That's a possible approach. However, if the index build takes long time in the CLUSTER case, the user will probably be interested in details about the index build. The other approach is that we report only on the index build, but in that case the interesting row of pg_stat_progress_cluster would disappear temporarily. That might also be confusing. Ideally we should report on all the commands in progress, but that is not trivial to implement. > [1] https://commitfest.postgresql.org/patch/5117/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > On 2025/04/15 2:13, Sami Imseih wrote: > >> While working on [1] I realized that some field of pg_stat_progress_cluste has > >> weird value. > > I ran into the same issue while working on [2], and eventually had to > withdraw that patch because of it. Have you considered reporting the progress of each command separately? I think that can be implemented by moving the progress related fields from PgBackendStatus into a new structure and by teaching the backend to insert a new instance of that structure into a shared hash table (dshash.c) whenever a command that needs progress reporting is being started. pgstat.c uses such a hash table for various statistics objects, however it's probably not a good idea to adjust this infrastructure by simply adding a new object kind (e.g. PGSTAT_KIND_COMMAND). I think that pgstat.c is designed for frequent updates of backend-local statistics and less frequent flushes (e.g. at command completion) to the shared memory. That's not suitable for progress reporting. > [2] https://commitfest.postgresql.org/patch/5282/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
>> pgstat_progress_start_command should only be called once by the entry >> point for the >> command. In theory, we could end up in a situation where start_command >> is called multiple times during the same top-level command; > Not only in theory - it actually happens when CLUSTER is rebuilding indexes. In the case of CLUSTER, pgstat_progress_start_command is only called once, but pgstat_progress_update_param is called in the context of both CLUSTER and CREATE INDEX. > That's a possible approach. However, if the index build takes long time in the > CLUSTER case, the user will probably be interested in details about the index > build. I agree, >> Is there a repro that you can share that shows the weird values? It sounds like >> the repro is on top of [1]. Is that right? >> You can reproduce the similar problem by creating a trigger function that >> runs a progress-reporting command like COPY, and then COPY data into >> a table that uses that trigger. >> [2] https://commitfest.postgresql.org/patch/5282/ In this case, pgstat_progress_start_command is actually called twice in the life of a single COPY command; the upper-level COPY command calls pgstat_progress_start_command and then the nested COPY command also does calls pgstat_progress_start_command. > I think that can be implemented by moving the progress related fields from > PgBackendStatus into a new structure and by teaching the backend to insert a > new instance of that structure into a shared hash table (dshash.c) I think this is a good idea in general to move the backend progress to shared memory. and with a new API that will deal with scenarios as described above. 1/ an (explicit) nested command was started by a top-level command, such as the COPY case above. 2/ a top-level command triggered some other progress code implicitly, such as CLUSTER triggering CREATE INDEX code. I also like the shared memory approach because we can then not have to use a message like the one introduced in f1889729dd3ab0 to support parallel index vacuum progress 46ebdfe164c61. -- Sami Imseih Amazon Web Services (AWS)
Sami Imseih <samimseih@gmail.com> wrote: > >> pgstat_progress_start_command should only be called once by the entry > >> point for the > >> command. In theory, we could end up in a situation where start_command > >> is called multiple times during the same top-level command; > > > Not only in theory - it actually happens when CLUSTER is rebuilding indexes. > > In the case of CLUSTER, pgstat_progress_start_command is only called once, > but pgstat_progress_update_param is called in the context of both CLUSTER > and CREATE INDEX. pgstat_progress_start_command() is called twice: First with cmdtype=PROGRESS_COMMAND_CLUSTER, second with PROGRESS_COMMAND_CREATE_INDEX. The first happens in cluster_rel() the second in cluster_rel() -> rebuild_relation() -> finish_heap_swap() -> reindex_relation() -> reindex_index(). It does not matter though, the current design only expects one command. > > That's a possible approach. However, if the index build takes long time in the > > CLUSTER case, the user will probably be interested in details about the index > > build. > > I agree, > > >> Is there a repro that you can share that shows the weird values? It sounds like > >> the repro is on top of [1]. Is that right? > > >> You can reproduce the similar problem by creating a trigger function that > >> runs a progress-reporting command like COPY, and then COPY data into > >> a table that uses that trigger. > > >> [2] https://commitfest.postgresql.org/patch/5282/ > > In this case, pgstat_progress_start_command is actually called > twice in the life of a single COPY command; the upper-level COPY > command calls pgstat_progress_start_command and then the nested COPY > command also does calls pgstat_progress_start_command. > > > I think that can be implemented by moving the progress related fields from > > PgBackendStatus into a new structure and by teaching the backend to insert a > > new instance of that structure into a shared hash table (dshash.c) > > I think this is a good idea in general to move the backend progress to > shared memory. > and with a new API that will deal with scenarios as described above. > 1/ an (explicit) nested > command was started by a top-level command, such as the COPY case above. > 2/ a top-level command triggered some other progress code implicitly, such as > CLUSTER triggering CREATE INDEX code. Yes, I mean a new API. I imagine pgstat_progress_start_command() to initialize the shared memory to track the "nested" command and to put the existing value of MyBEEntry onto a stack. pgstat_progress_end_command() would then restore the original value. However, special care is needed for [2] because that's not necessarily nesting: consider merge-joining two foreign tables, both using file_fdw. In this case the pointers to the progress tracking shared memory would probably have to be passed as function arguments. (Note that the current progress tracking system also uses shared memory, however in a less flexible way.) > I also like the shared memory approach because we can then not have to use > a message like the one introduced in f1889729dd3ab0 to support parallel index > vacuum progress 46ebdfe164c61. I didn't know about these patches. I'm not sure though if this needs to be removed. Even if each worker updated the progress information separately (would users appreciate that?), it should still send the progress information to the leader before it (i.e. the worker) exits. -- Antonin Houska Web: https://www.cybertec-postgresql.com
> pgstat_progress_start_command() is called twice: First with > cmdtype=PROGRESS_COMMAND_CLUSTER, second with > PROGRESS_COMMAND_CREATE_INDEX. The first happens in cluster_rel() the second > in cluster_rel() -> rebuild_relation() -> finish_heap_swap() -> > reindex_relation() -> reindex_index(). > > It does not matter though, the current design only expects one command. When I set a breakpoint on pgstat_progress_start_command and ran a CLUSTER on a table with a few indexes, I only saw start_command being called once for PROGRESS_COMMAND_CLUSTER. Then I went back in the code and realized that reindex_index when called via the CLUSTER command has progress set to false. So as it stands now, only PROGRESS_COMMAND_CLUSTER is effectively started when CLUSTER is called. ``` if (progress) { const int progress_cols[] = { PROGRESS_CREATEIDX_COMMAND, PROGRESS_CREATEIDX_INDEX_OID }; const int64 progress_vals[] = { PROGRESS_CREATEIDX_COMMAND_REINDEX, indexId }; pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId); pgstat_progress_update_multi_param(2, progress_cols, progress_vals); } ``` -- Sami
Sami Imseih <samimseih@gmail.com> wrote: > > pgstat_progress_start_command() is called twice: First with > > cmdtype=PROGRESS_COMMAND_CLUSTER, second with > > PROGRESS_COMMAND_CREATE_INDEX. The first happens in cluster_rel() the second > > in cluster_rel() -> rebuild_relation() -> finish_heap_swap() -> > > reindex_relation() -> reindex_index(). > > > > It does not matter though, the current design only expects one command. > > When I set a breakpoint on pgstat_progress_start_command and ran a > CLUSTER on a table with a few indexes, I only saw start_command being > called once for PROGRESS_COMMAND_CLUSTER. Then I went back in the > code and realized that reindex_index when called via the CLUSTER command > has progress set to false. So as it stands now, only PROGRESS_COMMAND_CLUSTER > is effectively started when CLUSTER is called. > > ``` > if (progress) > { > const int progress_cols[] = { > PROGRESS_CREATEIDX_COMMAND, > PROGRESS_CREATEIDX_INDEX_OID > }; > const int64 progress_vals[] = { > PROGRESS_CREATEIDX_COMMAND_REINDEX, > indexId > }; > > pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, > heapId); > pgstat_progress_update_multi_param(2, progress_cols, progress_vals); > } Ah, got it. So for now the REPACK CONCURRENTLY patch only needs to turn off the progress reporting for PROGRESS_COMMAND_CREATE_INDEX properly. Thanks for checking! -- Antonin Houska Web: https://www.cybertec-postgresql.com