Thread: Progress report of CREATE INDEX for nested partitioned tables
Hi, I have noticed that progress reporting for CREATE INDEX of partitioned tables seems to be working poorly for nested partitioned tables. In particular, it overwrites total and done partitions count when it recurses down to child partitioned tables and it only reports top-level partitions. So it's not hard to see something like this during CREATE INDEX now: postgres=# select partitions_total, partitions_done from pg_stat_progress_create_index ; partitions_total | partitions_done ------------------+----------------- 1 | 2 (1 row) I changed current behaviour to report the total number of partitions in the inheritance tree and fixed recursion in the attached patch. I used a static variable to keep the counter to avoid ABI breakage of DefineIndex, so that we could backpatch this to previous versions. Thanks, Ilya Gladyshev
Attachment
On Sat, Dec 10, 2022 at 12:18:32PM +0400, Ilya Gladyshev wrote: > Hi, > > I have noticed that progress reporting for CREATE INDEX of partitioned > tables seems to be working poorly for nested partitioned tables. In > particular, it overwrites total and done partitions count when it > recurses down to child partitioned tables and it only reports top-level > partitions. So it's not hard to see something like this during CREATE > INDEX now: > > postgres=# select partitions_total, partitions_done from > pg_stat_progress_create_index ; > partitions_total | partitions_done > ------------------+----------------- > 1 | 2 > (1 row) Yeah. I didn't verify, but it looks like this is a bug going to back to v12. As you said, when called recursively, DefineIndex() clobbers the number of completed partitions. Maybe DefineIndex() could flatten the list of partitions. But I don't think that can work easily with iteration rather than recursion. Could you check what I've written as a counter-proposal ? As long as we're changing partitions_done to include nested sub-partitions, it seems to me like we should exclude intermediate "catalog-only" partitioned indexes, and count only physical leaf partitions. Should it alo exclude any children with matching indexes, which will also be catalog-only changes? Probably not. The docs say: |When creating an index on a partitioned table, this column is set to the |total number of partitions on which the index is to be created. This |field is 0 during a REINDEX. > I changed current behaviour to report the total number of partitions in > the inheritance tree and fixed recursion in the attached patch. I used > a static variable to keep the counter to avoid ABI breakage of > DefineIndex, so that we could backpatch this to previous versions. I wrote a bunch of assertions for this, which seems to have uncovered an similar issue with COPY progress reporting, dating to 8a4f618e7. I'm not sure the assertions are okay. I imagine they may break other extensions, as with file_fdw. -- Justin
Attachment
Could you check what I've written as a counter-proposal ?
I think that this might be a good solution to start with, it gives us the opportunity to improve the granularity later without any surprising changes for the end user. We could use this patch for previous versions and make more granular output in the latest. What do you think?
As long as we're changing partitions_done to include nested
sub-partitions, it seems to me like we should exclude intermediate
"catalog-only" partitioned indexes, and count only physical leaf
partitions. Should it alo exclude any children with matching indexes,
which will also be catalog-only changes? Probably not.
The docs say:
|When creating an index on a partitioned table, this column is set to the
|total number of partitions on which the index is to be created. This
|field is 0 during a REINDEX.
I agree with you on catalog-only partitioned indexes, but I think that in the perfect world we should exclude all the relations where the index isn’t actually created, so that means excluding attached indexes as well. However, IMO doing it this way will require too much of a code rewrite for quite a minor feature (but we could do it, ofc). I actually think that the progress view would be better off without the total number of partitions, but I’m not sure we have this option now. With this in mind, I think your proposal to exclude catalog-only indexes sounds reasonable to me, but I feel like the docs are off in this case, because the attached indexes are not created, but we pretend like they are in this metric, so we should fix one or the other.
I changed current behaviour to report the total number of partitions in
the inheritance tree and fixed recursion in the attached patch. I used
a static variable to keep the counter to avoid ABI breakage of
DefineIndex, so that we could backpatch this to previous versions.
I wrote a bunch of assertions for this, which seems to have uncovered an
similar issue with COPY progress reporting, dating to 8a4f618e7. I'm
not sure the assertions are okay. I imagine they may break other
extensions, as with file_fdw.
--
Justin
<0001-fix-progress-reporting-of-nested-partitioned-indexes.patch>
On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote: > > > Could you check what I've written as a counter-proposal ? > > I think that this might be a good solution to start with, it gives us the opportunity to improve the granularity laterwithout any surprising changes for the end user. We could use this patch for previous versions and make more granularoutput in the latest. What do you think? Somehow, it hadn't occured to me that my patch "lost granularity" by incrementing the progress bar by more than one... Shoot. > I actually think that the progress view would be better off without the total number of partitions, Just curious - why ? > With this in mind, I think your proposal to exclude catalog-only indexes sounds reasonable to me, but I feel like the docsare off in this case, because the attached indexes are not created, but we pretend like they are in this metric, so weshould fix one or the other. I agree that the docs should indicate whether we're counting "all partitions", "direct partitions", and whether or not that includes partitioned partitions, or just leaf partitions. I have another proposal: since the original patch 3.5 years ago didn't consider or account for sub-partitions, let's not start counting them now. It was never defined whether they were included or not (and I guess that they're not common) so we can take this opportunity to clarify the definition. Alternately, if it's okay to add nparts_done to the IndexStmt, then that's easy. -- Justin
Attachment
On Mon, 2022-12-12 at 22:43 -0600, Justin Pryzby wrote: > On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote: > > > > > Could you check what I've written as a counter-proposal ? > > > > I think that this might be a good solution to start with, it gives > > us the opportunity to improve the granularity later without any > > surprising changes for the end user. We could use this patch for > > previous versions and make more granular output in the latest. What > > do you think? > > Somehow, it hadn't occured to me that my patch "lost granularity" by > incrementing the progress bar by more than one... Shoot. > > > I actually think that the progress view would be better off without > > the total number of partitions, > > Just curious - why ? We don't really know how many indexes we are going to create, unless we have some kind of preliminary "planning" stage where we acumulate all the relations that will need to have indexes created (rather than attached). And if someone wants the total, it can be calculated manually without this view, it's less user-friendly, but if we can't do it well, I would leave it up to the user. > > > With this in mind, I think your proposal to exclude catalog-only > > indexes sounds reasonable to me, but I feel like the docs are off > > in this case, because the attached indexes are not created, but we > > pretend like they are in this metric, so we should fix one or the > > other. > > I agree that the docs should indicate whether we're counting "all > partitions", "direct partitions", and whether or not that includes > partitioned partitions, or just leaf partitions. Agree. I think that docs should also be explicit about the attached indexes, if we decide to count them in as "created". > I have another proposal: since the original patch 3.5 years ago > didn't > consider or account for sub-partitions, let's not start counting them > now. It was never defined whether they were included or not (and I > guess that they're not common) so we can take this opportunity to > clarify the definition. I have had this thought initially, but then I thought that it's not what I would want, if I was to track progress of multi-level partitioned tables (but yeah, I guess it's pretty uncommon). In this respect, I like your initial counter-proposal more, because it leaves us room to improve this in the future. Otherwise, if we commit to reporting only top-level partitions now, I'm not sure we will have the opportunity to change this. > Alternately, if it's okay to add nparts_done to the IndexStmt, then > that's easy. Yeah, or we could add another argument to DefineIndex. I don't know if it's ok, or which option is better here in terms of compatibility and interface-wise, so I have tried both of them, see the attached patches.
Attachment
On Tue, Dec 13, 2022 at 10:18:58PM +0400, Ilya Gladyshev wrote: > > > I actually think that the progress view would be better off without > > > the total number of partitions, > > > > Just curious - why ? > > We don't really know how many indexes we are going to create, unless we > have some kind of preliminary "planning" stage where we acumulate all > the relations that will need to have indexes created (rather than > attached). And if someone wants the total, it can be calculated > manually without this view, it's less user-friendly, but if we can't do > it well, I would leave it up to the user. Thanks. One other reason is that the partitions (and sub-partitions) may not be equally sized. Also, I've said before that it's weird to report macroscopic progress about the number of partitions finihed in the same place as reporting microscopic details like the number of blocks done of the relation currently being processed. > > I have another proposal: since the original patch 3.5 years ago > > didn't > > consider or account for sub-partitions, let's not start counting them > > now. It was never defined whether they were included or not (and I > > guess that they're not common) so we can take this opportunity to > > clarify the definition. > > I have had this thought initially, but then I thought that it's not > what I would want, if I was to track progress of multi-level > partitioned tables (but yeah, I guess it's pretty uncommon). In this > respect, I like your initial counter-proposal more, because it leaves > us room to improve this in the future. Otherwise, if we commit to > reporting only top-level partitions now, I'm not sure we will have the > opportunity to change this. We have the common problem of too many patches. https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com This changes the progress reporting to show indirect children as "total", and adds a global variable to track recursion into DefineIndex(), allowing it to be incremented without the value being lost to the caller. https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com This also counts indirect children, but only increments the progress reporting in the parent. This has the disadvantage that when intermediate partitions are in use, the done_partitions counter will "jump" from (say) 20 to 30 without ever hitting 21-29. https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com This has two alternate patches: - One patch changes to only update progress reporting of *direct* children. This is minimal, but discourages any future plan to track progress involving intermediate partitions with finer granularity. - A alternative patch adds IndexStmt.nparts_done, and allows reporting fine-grained progress involving intermediate partitions. https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.camel@gmail.com This also reports progress of intermediate children. The first patch does it by adding an argument to DefineIndex() (which isn't okay to backpatch). And an alternate patch does it by adding to IndexStmt. @committers: Is it okay to add nparts_done to IndexStmt ? -- Justin
On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: > We have the common problem of too many patches. > > https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com > This changes the progress reporting to show indirect children as > "total", and adds a global variable to track recursion into > DefineIndex(), allowing it to be incremented without the value being > lost to the caller. > > https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com > This also counts indirect children, but only increments the progress > reporting in the parent. This has the disadvantage that when > intermediate partitions are in use, the done_partitions counter will > "jump" from (say) 20 to 30 without ever hitting 21-29. > > https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com > This has two alternate patches: > - One patch changes to only update progress reporting of *direct* > children. This is minimal, but discourages any future plan to track > progress involving intermediate partitions with finer granularity. > - A alternative patch adds IndexStmt.nparts_done, and allows reporting > fine-grained progress involving intermediate partitions. > > https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.camel@gmail.com > This also reports progress of intermediate children. The first patch > does it by adding an argument to DefineIndex() (which isn't okay to > backpatch). And an alternate patch does it by adding to IndexStmt. > > @committers: Is it okay to add nparts_done to IndexStmt ? Any hint about this ? This should be resolved before the "CIC on partitioned tables" patch, which I think is otherwise done.
On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote: > On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: > > We have the common problem of too many patches. > > > > https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com > > This changes the progress reporting to show indirect children as > > "total", and adds a global variable to track recursion into > > DefineIndex(), allowing it to be incremented without the value > > being > > lost to the caller. > > > > https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com > > This also counts indirect children, but only increments the > > progress > > reporting in the parent. This has the disadvantage that when > > intermediate partitions are in use, the done_partitions counter > > will > > "jump" from (say) 20 to 30 without ever hitting 21-29. > > > > https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com > > This has two alternate patches: > > - One patch changes to only update progress reporting of *direct* > > children. This is minimal, but discourages any future plan to > > track > > progress involving intermediate partitions with finer > > granularity. > > - A alternative patch adds IndexStmt.nparts_done, and allows > > reporting > > fine-grained progress involving intermediate partitions. > > > > https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.camel@gmail.com > > This also reports progress of intermediate children. The first > > patch > > does it by adding an argument to DefineIndex() (which isn't okay to > > backpatch). And an alternate patch does it by adding to IndexStmt. > > > > @committers: Is it okay to add nparts_done to IndexStmt ? > > Any hint about this ? > > This should be resolved before the "CIC on partitioned tables" patch, > which I think is otherwise done. I suggest that we move on with the IndexStmt patch and see what the committers have to say about it. I have brushed the patch up a bit, fixing TODOs and adding docs as per our discussion above.
Attachment
On 1/9/23 09:44, Ilya Gladyshev wrote: > On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote: >> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: >>> ... >>> >>> @committers: Is it okay to add nparts_done to IndexStmt ? >> >> Any hint about this ? >> AFAIK fields added at the end of a struct is seen as acceptable from the ABI point of view. It's not risk-free, but we did that multiple times when fixing bugs, IIRC. The primary risk is old extensions (built on older minor version) running on new server, getting confused by new fields (and implied shifts in the structs). But fields at the end should be safe - the extension simply ignores the stuff at the end. The one problem would be arrays of structs, because even a field at the end changes the array stride. But I don't think we do that with IndexStmt ... Of course, if the "old" extension itself allocates the struct and passes it to core code, that might still be an issue, because it'll allocate a smaller struct, and core might see bogus data at the end. On the other hand, new extensions on old server may get confused too, because it may try setting a field that does not exist. So ultimately it's about weighing risks vs. benefits - evaluating whether fixing the issue is actually worth it. The question is if/how many such extensions messing with IndexStmt in this way actually exist. That is, allocate IndexStmt (or array of it). I haven't found any, but maybe some extensions for index or partition management do it? Not sure. But ... Do we actually need the new parts_done field? I mean, we already do track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the st_progress_param array. Can't we simply read it from there? Then we would not have ABI issues with the new field added to IndexStmt. >> This should be resolved before the "CIC on partitioned tables" patch, >> which I think is otherwise done. > > I suggest that we move on with the IndexStmt patch and see what the > committers have to say about it. I have brushed the patch up a bit, > fixing TODOs and adding docs as per our discussion above. > I did take a look at the patch, so here are my 2c: 1) num_leaf_partitions says it's "excluding foreign tables" but then it uses RELKIND_HAS_STORAGE() which excludes various others relkinds, e.g. partitioned tables etc. Minor, but perhaps a bit confusing. 2) I'd probably say count_leaf_partitions() instead. 3) The new part in DefineIndex counting leaf partitions should have a comment before if (!OidIsValid(parentIndexId)) { ... } 4) It's a bit confusing that one of the branches in DefineIndex just sets stmt->parts_done without calling pgstat_progress_update_param (while the other one does both). AFAICS the call is not needed because we already updated it during the recursive DefineIndex call, but maybe the comment should mention that? As for the earlier discussion about the "correct" behavior for leaf vs. non-leaf partitions and whether to calculate partitions in advance: * I agree it's desirable to count partitions in advance, instead of adding incrementally. The view is meant to provide "overview" of the CREATE INDEX progress, and imagine you get partitions_total partitions_done 10 9 so that you believe you're ~90% done. But then it jumps to the next child and now you get partitions_total partitions_done 20 10 which makes the view a bit useless for it's primary purpose, IMHO. * I don't care very much about leaf vs. non-leaf partitions. If we exclude non-leaf ones, fine with me. But the number of non-leaf ones should be much smaller than leaf ones, and if the partition already has a matching index that distorts the tracking too. Furthermore the partitions may have different size etc. so the progress is only approximate anyway. I wonder if we could improve this to track the size of partitions for total/done? That'd make leaf/non-leaf distinction unnecessary, because non-leaf partitions have size 0. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote: > On 1/9/23 09:44, Ilya Gladyshev wrote: > > On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote: > >> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: > >>> ... > >>> > >>> @committers: Is it okay to add nparts_done to IndexStmt ? > >> > >> Any hint about this ? > >> > > AFAIK fields added at the end of a struct is seen as acceptable from the > ABI point of view. It's not risk-free, but we did that multiple times > when fixing bugs, IIRC. My question isn't whether it's okay to add a field at the end of a struct in general, but rather whether it's acceptable to add this field at the end of this struct, and not because it's unsafe to do in a minor release, but whether someone is going to say that it's an abuse of the data structure. > Do we actually need the new parts_done field? I mean, we already do > track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the > st_progress_param array. Can't we simply read it from there? Then we > would not have ABI issues with the new field added to IndexStmt. Good idea to try. > As for the earlier discussion about the "correct" behavior for leaf vs. > non-leaf partitions and whether to calculate partitions in advance: > > * I agree it's desirable to count partitions in advance, instead of > adding incrementally. The view is meant to provide "overview" of the > CREATE INDEX progress, and imagine you get > > partitions_total partitions_done > 10 9 > > so that you believe you're ~90% done. But then it jumps to the next > child and now you get > > partitions_total partitions_done > 20 10 > > which makes the view a bit useless for it's primary purpose, IMHO. To be clear, that's the current, buggy behavior, and this thread is about fixing it. The proposed patches all ought to avoid that. But the bug isn't caused by not "calculating partitions in advance". Rather, the issue is that currently, the "total" is overwritten while recursing. That's a separate question from whether indirect partitions are counted or not. > I wonder if we could improve this to track the size of partitions for > total/done? That'd make leaf/non-leaf distinction unnecessary, because > non-leaf partitions have size 0. Maybe, but it's out of scope for this patch. Thanks for looking. -- Justin
On 1/17/23 21:59, Justin Pryzby wrote: > On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote: >> On 1/9/23 09:44, Ilya Gladyshev wrote: >>> On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote: >>>> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: >>>>> ... >>>>> >>>>> @committers: Is it okay to add nparts_done to IndexStmt ? >>>> >>>> Any hint about this ? >>>> >> >> AFAIK fields added at the end of a struct is seen as acceptable from the >> ABI point of view. It's not risk-free, but we did that multiple times >> when fixing bugs, IIRC. > > My question isn't whether it's okay to add a field at the end of a > struct in general, but rather whether it's acceptable to add this field > at the end of this struct, and not because it's unsafe to do in a minor > release, but whether someone is going to say that it's an abuse of the > data structure. > Ah, you mean whether it's the right place for the parameter? I don't think it is, really. IndexStmt is meant to be a description of the CREATE INDEX statement, not something that includes info about how it's processed. But it's the only struct we pass to the DefineIndex for child indexes, so the only alternatives I can think of is a global variable and the (existing) param array field. Nevertheless, ABI compatibility is still relevant for backbranches. >> Do we actually need the new parts_done field? I mean, we already do >> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the >> st_progress_param array. Can't we simply read it from there? Then we >> would not have ABI issues with the new field added to IndexStmt. > > Good idea to try. > OK >> As for the earlier discussion about the "correct" behavior for leaf vs. >> non-leaf partitions and whether to calculate partitions in advance: >> >> * I agree it's desirable to count partitions in advance, instead of >> adding incrementally. The view is meant to provide "overview" of the >> CREATE INDEX progress, and imagine you get >> >> partitions_total partitions_done >> 10 9 >> >> so that you believe you're ~90% done. But then it jumps to the next >> child and now you get >> >> partitions_total partitions_done >> 20 10 >> >> which makes the view a bit useless for it's primary purpose, IMHO. > > To be clear, that's the current, buggy behavior, and this thread is > about fixing it. The proposed patches all ought to avoid that. > > But the bug isn't caused by not "calculating partitions in advance". > Rather, the issue is that currently, the "total" is overwritten while > recursing. > You're right the issue us about overwriting the total - not sure what I was thinking about when writing this. I guess I got distracted by the discussion about "preliminary planning" etc. Sorry for the confusion. > That's a separate question from whether indirect partitions are counted > or not. > >> I wonder if we could improve this to track the size of partitions for >> total/done? That'd make leaf/non-leaf distinction unnecessary, because >> non-leaf partitions have size 0. > > Maybe, but it's out of scope for this patch. > +1, it was just an idea for future. -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
TBH, I think the best approach is what I did in: 0001-report-top-parent-progress-for-CREATE-INDEX.txt That's a minimal patch, ideal for backpatching. ..which defines/clarifies that the progress reporting is only for *direct* children. That avoids the need to change any data structures, and it's what was probably intended by the original patch, which doesn't seem to have considered intermediate partitioned tables. I think it'd be fine to re-define that in some future release, to allow showing indirect children (probably only "leaves", and not intermediate partitioned tables). Or "total_bytes" or other global progress. -- Justin
On Wed, 18 Jan 2023 at 15:25, Justin Pryzby <pryzby@telsasoft.com> wrote: > > TBH, I think the best approach is what I did in: > 0001-report-top-parent-progress-for-CREATE-INDEX.txt > > That's a minimal patch, ideal for backpatching. > > ..which defines/clarifies that the progress reporting is only for > *direct* children. That avoids the need to change any data structures, > and it's what was probably intended by the original patch, which doesn't > seem to have considered intermediate partitioned tables. > > I think it'd be fine to re-define that in some future release, to allow > showing indirect children (probably only "leaves", and not intermediate > partitioned tables). Or "total_bytes" or other global progress. > Hmm. My expectation as a user is that partitions_total includes both direct and indirect (leaf) child partitions, that it is set just once at the start of the process, and that partitions_done increases from zero to partitions_total as the index-build proceeds. I think that should be achievable with a minimally invasive patch that doesn't change any data structures. I agree with all the review comments Tomas posted. In particular, this shouldn't need any changes to IndexStmt. I think the best approach would be to just add a new function to backend_progress.c that offsets a specified progress parameter by a specified amount, so that you can just increment partitions_done by one or more, at the appropriate points. Regards, Dean
> 17 янв. 2023 г., в 23:44, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): > Do we actually need the new parts_done field? I mean, we already do > track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the > st_progress_param array. Can't we simply read it from there? Then we > would not have ABI issues with the new field added to IndexStmt. I think it’s a good approach and it could be useful outside of scope of this patch too. So I have attached a patch, thatintroduces pgstat_progress_incr_param function for this purpose. There’s one thing I am not sure about, IIUC, we canassume that the only process that can write into MyBEEntry of the current backend is the current backend itself, thereforelooping to get consistent reads from this array is not required. Please correct me, if I am wrong here.
Attachment
On Tue, Jan 31, 2023 at 07:32:20PM +0400, Ilya Gladyshev wrote: > > 17 янв. 2023 г., в 23:44, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): > > Do we actually need the new parts_done field? I mean, we already do > > track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the > > st_progress_param array. Can't we simply read it from there? Then we > > would not have ABI issues with the new field added to IndexStmt. > > I think it’s a good approach and it could be useful outside of scope of this patch too. So I have attached a patch, thatintroduces pgstat_progress_incr_param function for this purpose. There’s one thing I am not sure about, IIUC, we canassume that the only process that can write into MyBEEntry of the current backend is the current backend itself, thereforelooping to get consistent reads from this array is not required. Please correct me, if I am wrong here. Thanks for the updated patch. I think you're right - pgstat_begin_read_activity() is used for cases involving other backends. It ought to be safe for a process to read its own status bits, since we know they're not also being written. You changed DefineIndex() to update progress for the leaf indexes' when called recursively. The caller updates the progress for "attached" indexes, but not created ones. That allows providing fine-granularity progress updates when using intermediate partitions, right ? (Rather than updating the progress by more than one at a time in the case of intermediate partitioning). If my understanding is right, that's subtle, and adds a bit of complexity to the current code, so could use careful commentary. I suggest: * If the index was attached, update progress for all its direct and * indirect leaf indexes all at once. If the index was built by calling * DefineIndex() recursively, the called function is responsible for * updating the progress report for built indexes. ... * If this is the top-level index, we're done. When called recursively * for child tables, the done partition counter is incremented now, * rather than in the caller. I guess you know that there were compiler warnings (related to your question). https://cirrus-ci.com/task/6571212386598912 pgstat_progress_incr_param() could call pgstat_progress_update_param() rather than using its own Assert() and WRITE_ACTIVITY calls. I'm not sure which I prefer, though. Also, there are whitespace/tab/style issues in pgstat_progress_incr_param(). -- Justin
> 1 февр. 2023 г., в 08:29, Justin Pryzby <pryzby@telsasoft.com> написал(а): > > On Tue, Jan 31, 2023 at 07:32:20PM +0400, Ilya Gladyshev wrote: >>> 17 янв. 2023 г., в 23:44, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а): >>> Do we actually need the new parts_done field? I mean, we already do >>> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the >>> st_progress_param array. Can't we simply read it from there? Then we >>> would not have ABI issues with the new field added to IndexStmt. >> >> I think it’s a good approach and it could be useful outside of scope of this patch too. So I have attached a patch, thatintroduces pgstat_progress_incr_param function for this purpose. There’s one thing I am not sure about, IIUC, we canassume that the only process that can write into MyBEEntry of the current backend is the current backend itself, thereforelooping to get consistent reads from this array is not required. Please correct me, if I am wrong here. > > Thanks for the updated patch. > > I think you're right - pgstat_begin_read_activity() is used for cases > involving other backends. It ought to be safe for a process to read its > own status bits, since we know they're not also being written. > > You changed DefineIndex() to update progress for the leaf indexes' when > called recursively. The caller updates the progress for "attached" > indexes, but not created ones. That allows providing fine-granularity > progress updates when using intermediate partitions, right ? (Rather > than updating the progress by more than one at a time in the case of > intermediate partitioning). > > If my understanding is right, that's subtle, and adds a bit of > complexity to the current code, so could use careful commentary. I > suggest: > > * If the index was attached, update progress for all its direct and > * indirect leaf indexes all at once. If the index was built by calling > * DefineIndex() recursively, the called function is responsible for > * updating the progress report for built indexes. > > ... > > * If this is the top-level index, we're done. When called recursively > * for child tables, the done partition counter is incremented now, > * rather than in the caller. Yes, you are correct about the intended behavior, I added your comments to the patch. > I guess you know that there were compiler warnings (related to your > question). > https://cirrus-ci.com/task/6571212386598912 > > pgstat_progress_incr_param() could call pgstat_progress_update_param() > rather than using its own Assert() and WRITE_ACTIVITY calls. I'm not > sure which I prefer, though. > > Also, there are whitespace/tab/style issues in > pgstat_progress_incr_param(). > > -- > Justin Thank you for the review, I fixed the aforementioned issues in the v2.
Attachment
Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache lookup for every single element therein ... this sounds slow. In one of the callsites, we already have the partition descriptor available. We could just scan partdesc->is_leaf[] and add one for each 'true' value we see there. In the other callsite, we had the table open just a few lines before the place you call count_leaf_partitions. Maybe we can rejigger things by examining its state before closing it: if relkind is not partitioned we know leaf_partitions=0, and only if partitioned we count leaf partitions. I think that would save some work. I also wonder if it's worth writing a bespoke function for counting leaf partitions rather than relying on find_all_inheritors. I think there's probably not much point optimizing it further than that. If there was, then we could think about creating a data representation that we can build for the entire partitioning hierarchy in a single pass with the count of leaf partitions that sit below each specific non-leaf; but I think that's just over-engineering. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "After a quick R of TFM, all I can say is HOLY CR** THAT IS COOL! PostgreSQL was amazing when I first started using it at 7.2, and I'm continually astounded by learning new features and techniques made available by the continuing work of the development team." Berend Tober, http://archives.postgresql.org/pgsql-hackers/2007-08/msg01009.php
> 1 февр. 2023 г., в 16:01, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а): > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache > lookup for every single element therein ... this sounds slow. > > In one of the callsites, we already have the partition descriptor > available. We could just scan partdesc->is_leaf[] and add one for each > 'true' value we see there. The problem is that partdesc contains only direct children of the table and we need all the children down the inheritancetree to count the total number of leaf partitions in the first callsite. > In the other callsite, we had the table open just a few lines before the > place you call count_leaf_partitions. Maybe we can rejigger things by > examining its state before closing it: if relkind is not partitioned we > know leaf_partitions=0, and only if partitioned we count leaf partitions. > I think that would save some work. I also wonder if it's worth writing > a bespoke function for counting leaf partitions rather than relying on > find_all_inheritors. Sure, added this condition to avoid the extra work here.
Attachment
On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > > 1 февр. 2023 г., в 16:01, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а): > > > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache > > lookup for every single element therein ... this sounds slow. > > > > In one of the callsites, we already have the partition descriptor > > available. We could just scan partdesc->is_leaf[] and add one for each > > 'true' value we see there. > > The problem is that partdesc contains only direct children of the table and we need all the children down the inheritancetree to count the total number of leaf partitions in the first callsite. > > > In the other callsite, we had the table open just a few lines before the > > place you call count_leaf_partitions. Maybe we can rejigger things by > > examining its state before closing it: if relkind is not partitioned we > > know leaf_partitions=0, and only if partitioned we count leaf partitions. > > I think that would save some work. I also wonder if it's worth writing > > a bespoke function for counting leaf partitions rather than relying on > > find_all_inheritors. > > Sure, added this condition to avoid the extra work here. > > When creating an index on a partitioned table, this column is set to > - the total number of partitions on which the index is to be created. > + the total number of leaf partitions on which the index is to be created or attached. I think we should also add a note about the (now) non-constant nature of the value, something along the lines of "This value is updated as we're processing and discovering partitioned tables in the partition hierarchy". Kind regards, Matthias van de Meent
On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote: > On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > > 1 февр. 2023 г., в 16:01, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а): > > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache > > > lookup for every single element therein ... this sounds slow. > > > > > > In one of the callsites, we already have the partition descriptor > > > available. We could just scan partdesc->is_leaf[] and add one for each > > > 'true' value we see there. > > > > The problem is that partdesc contains only direct children of the table and we need all the children down the inheritancetree to count the total number of leaf partitions in the first callsite. > > > > > In the other callsite, we had the table open just a few lines before the > > > place you call count_leaf_partitions. Maybe we can rejigger things by > > > examining its state before closing it: if relkind is not partitioned we > > > know leaf_partitions=0, and only if partitioned we count leaf partitions. > > > I think that would save some work. I also wonder if it's worth writing > > > a bespoke function for counting leaf partitions rather than relying on > > > find_all_inheritors. > > > > Sure, added this condition to avoid the extra work here. > > > > > When creating an index on a partitioned table, this column is set to > > - the total number of partitions on which the index is to be created. > > + the total number of leaf partitions on which the index is to be created or attached. > > I think we should also add a note about the (now) non-constant nature > of the value, something along the lines of "This value is updated as > we're processing and discovering partitioned tables in the partition > hierarchy". But the TOTAL is constant, right ? Updating the total when being called recursively is the problem these patches fix. -- Justin
On Wed, 1 Feb 2023 at 16:53, Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote: > > On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > > > 1 февр. 2023 г., в 16:01, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а): > > > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache > > > > lookup for every single element therein ... this sounds slow. > > > > > > > > In one of the callsites, we already have the partition descriptor > > > > available. We could just scan partdesc->is_leaf[] and add one for each > > > > 'true' value we see there. > > > > > > The problem is that partdesc contains only direct children of the table and we need all the children down the inheritancetree to count the total number of leaf partitions in the first callsite. > > > > > > > In the other callsite, we had the table open just a few lines before the > > > > place you call count_leaf_partitions. Maybe we can rejigger things by > > > > examining its state before closing it: if relkind is not partitioned we > > > > know leaf_partitions=0, and only if partitioned we count leaf partitions. > > > > I think that would save some work. I also wonder if it's worth writing > > > > a bespoke function for counting leaf partitions rather than relying on > > > > find_all_inheritors. > > > > > > Sure, added this condition to avoid the extra work here. > > > > > > > > When creating an index on a partitioned table, this column is set to > > > - the total number of partitions on which the index is to be created. > > > + the total number of leaf partitions on which the index is to be created or attached. > > > > I think we should also add a note about the (now) non-constant nature > > of the value, something along the lines of "This value is updated as > > we're processing and discovering partitioned tables in the partition > > hierarchy". > > But the TOTAL is constant, right ? Updating the total when being called > recursively is the problem these patches fix. If that's the case, then I'm not seeing the 'fix' part of the patch. I thought this patch was fixing the provably incorrect TOTAL value where DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL values instead of updating them. In HEAD we set TOTAL to whatever number partitioned table we're currently processing has - regardless of whether we're the top level statement. With the patch we instead add the number of child relations to that count, for which REL_HAS_STORAGE(child) -- or at least, in the v3 posted by Ilya. Approximately immediately after updating that count we recurse to the child relations, and that only returns once it is done creating the indexes, so both TOTAL and DONE go up as we process more partitions in the hierarchy. An example hierarchy: CREATE TABLE parent (a, b) partition by list (a); CREATE TABLE a1 PARTITION OF parent FOR VALUES IN (1) PARTITION BY LIST (b); CREATE TABLE a1bd PARTITION OF a1 DEFAULT; CREATE TABLE a2 PARTITION OF parent FOR VALUES IN (2) PARTITION BY LIST (b); CREATE TABLE a2bd PARTITION OF a2 DEFAULT; INSERT INTO parent (a, b) SELECT * from generate_series(1, 2) a(a) cross join generate_series(1, 100000),b(b); CREATE INDEX ON parent(a,b); This will only discover that a2bd will need to be indexed after a1bd is done (or vice versa, depending on which order a1 and a2 are processed in the ). Kind regards, Matthias van de Meent
1 февр. 2023 г., в 20:27, Matthias van de Meent <boekewurm+postgres@gmail.com> написал(а):On Wed, 1 Feb 2023 at 16:53, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote:1 февр. 2023 г., в 16:01, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а):
Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
lookup for every single element therein ... this sounds slow.
In one of the callsites, we already have the partition descriptor
available. We could just scan partdesc->is_leaf[] and add one for each
'true' value we see there.
The problem is that partdesc contains only direct children of the table and we need all the children down the inheritance tree to count the total number of leaf partitions in the first callsite.In the other callsite, we had the table open just a few lines before the
place you call count_leaf_partitions. Maybe we can rejigger things by
examining its state before closing it: if relkind is not partitioned we
know leaf_partitions=0, and only if partitioned we count leaf partitions.
I think that would save some work. I also wonder if it's worth writing
a bespoke function for counting leaf partitions rather than relying on
find_all_inheritors.
Sure, added this condition to avoid the extra work here.When creating an index on a partitioned table, this column is set to
- the total number of partitions on which the index is to be created.
+ the total number of leaf partitions on which the index is to be created or attached.
I think we should also add a note about the (now) non-constant nature
of the value, something along the lines of "This value is updated as
we're processing and discovering partitioned tables in the partition
hierarchy".
But the TOTAL is constant, right ? Updating the total when being called
recursively is the problem these patches fix.
If that's the case, then I'm not seeing the 'fix' part of the patch. I
thought this patch was fixing the provably incorrect TOTAL value where
DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
values instead of updating them.
In HEAD we set TOTAL to whatever number partitioned table we're
currently processing has - regardless of whether we're the top level
statement.
With the patch we instead add the number of child relations to that
count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
posted by Ilya. Approximately immediately after updating that count we
recurse to the child relations, and that only returns once it is done
creating the indexes, so both TOTAL and DONE go up as we process more
partitions in the hierarchy.
On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > 1 февр. 2023 г., в 20:27, Matthias van de Meent <boekewurm+postgres@gmail.com> написал(а): > >> In HEAD we set TOTAL to whatever number partitioned table we're >> currently processing has - regardless of whether we're the top level >> statement. >> With the patch we instead add the number of child relations to that >> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3 >> posted by Ilya. Approximately immediately after updating that count we >> recurse to the child relations, and that only returns once it is done >> creating the indexes, so both TOTAL and DONE go up as we process more >> partitions in the hierarchy. > > > The TOTAL in the patch is set only when processing the top-level parent and it is not updated when we recurse, so yes,it is constant. From v3: Ugh, I misread the patch, more specifically count_leaf_partitions and the !OidIsValid(parentIndexId) condition changes. You are correct, sorry for the noise. Kind regards, Matthias van de Meent
On Wed, Feb 01, 2023 at 07:24:48PM +0100, Matthias van de Meent wrote: > On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > > > 1 февр. 2023 г., в 20:27, Matthias van de Meent <boekewurm+postgres@gmail.com> написал(а): > > > >> In HEAD we set TOTAL to whatever number partitioned table we're > >> currently processing has - regardless of whether we're the top level > >> statement. > >> With the patch we instead add the number of child relations to that > >> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3 > >> posted by Ilya. Approximately immediately after updating that count we > >> recurse to the child relations, and that only returns once it is done > >> creating the indexes, so both TOTAL and DONE go up as we process more > >> partitions in the hierarchy. > > > > > > The TOTAL in the patch is set only when processing the top-level parent and it is not updated when we recurse, so yes,it is constant. From v3: > > Ugh, I misread the patch, more specifically count_leaf_partitions and > the !OidIsValid(parentIndexId) condition changes. > > You are correct, sorry for the noise. That suggests that the comments could've been more clear. I added a comment suggested by Tomas and adjusted some others and wrote a commit message. I even ran pgindent for about the 3rd time ever. 002 are my changes as a separate patch, which you could apply to your local branch. And 003/4 are assertions that I wrote to demonstrate the problem and the verify the fixes, but not being proposed for commit. -- Justin
Attachment
On Thu, Feb 02, 2023 at 09:18:07AM -0600, Justin Pryzby wrote: > On Wed, Feb 01, 2023 at 07:24:48PM +0100, Matthias van de Meent wrote: > > On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev <ilya.v.gladyshev@gmail.com> wrote: > > > 1 февр. 2023 г., в 20:27, Matthias van de Meent <boekewurm+postgres@gmail.com> написал(а): > > > > > >> In HEAD we set TOTAL to whatever number partitioned table we're > > >> currently processing has - regardless of whether we're the top level > > >> statement. > > >> With the patch we instead add the number of child relations to that > > >> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3 > > >> posted by Ilya. Approximately immediately after updating that count we > > >> recurse to the child relations, and that only returns once it is done > > >> creating the indexes, so both TOTAL and DONE go up as we process more > > >> partitions in the hierarchy. > > > > > > The TOTAL in the patch is set only when processing the top-level parent and it is not updated when we recurse, so yes,it is constant. From v3: > > > > Ugh, I misread the patch, more specifically count_leaf_partitions and > > the !OidIsValid(parentIndexId) condition changes. > > > > You are correct, sorry for the noise. > > That suggests that the comments could've been more clear. I added a > comment suggested by Tomas and adjusted some others and wrote a commit > message. I even ran pgindent for about the 3rd time ever. > > 002 are my changes as a separate patch, which you could apply to your > local branch. > > And 003/4 are assertions that I wrote to demonstrate the problem and the > verify the fixes, but not being proposed for commit. That was probably a confusing way to present it - I should've sent the relative diff as a .txt rather than as patch 002. This squishes together 001/2 as the main patch. I believe it's ready. -- Justin
Attachment
On Wed, Feb 08, 2023 at 04:40:49PM -0600, Justin Pryzby wrote: > This squishes together 001/2 as the main patch. > I believe it's ready. Update to address a compiler warning in the supplementary patches adding assertions.
Attachment
Justin Pryzby <pryzby@telsasoft.com> writes: > Update to address a compiler warning in the supplementary patches adding > assertions. I took a look through this. It seems like basically a good solution, but the count_leaf_partitions() function is bothering me, for two reasons: 1. It seems like a pretty expensive thing to do. Don't we have the info at hand somewhere already? 2. Is it really safe to do find_all_inheritors with NoLock? If so, a comment explaining why would be good. regards, tom lane
On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > Update to address a compiler warning in the supplementary patches adding > > assertions. > > I took a look through this. It seems like basically a good solution, > but the count_leaf_partitions() function is bothering me, for two > reasons: > > 1. It seems like a pretty expensive thing to do. Don't we have the > info at hand somewhere already? I don't know where that would be. We need the list of both direct *and* indirect partitions. See: https://www.postgresql.org/message-id/5073D187-4200-4A2D-BAC0-91C657E3C22E%40gmail.com If it would help to avoid the concern, then I might consider proposing not to call get_rel_relkind() ... > 2. Is it really safe to do find_all_inheritors with NoLock? If so, > a comment explaining why would be good. In both cases (both for the parent and for case of a partitioned child with pre-existing indexes being ATTACHed), the table itself is already locked by DefineIndex(): lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock; rel = table_open(relationId, lockmode); and childrel = table_open(childRelid, lockmode); ... table_close(childrel, NoLock); And, find_all_inheritors() will also have been called by ProcessUtilitySlow(). Maybe it's sufficient to mention that ? -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote: >> I took a look through this. It seems like basically a good solution, >> but the count_leaf_partitions() function is bothering me, for two >> reasons: > ... find_all_inheritors() will also have been called by > ProcessUtilitySlow(). Maybe it's sufficient to mention that ? Hm. Could we get rid of count_leaf_partitions by doing the work in ProcessUtilitySlow? Or at least passing that OID list forward instead of recomputing it? regards, tom lane
On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote: > >> I took a look through this. It seems like basically a good solution, > >> but the count_leaf_partitions() function is bothering me, for two > >> reasons: > > > ... find_all_inheritors() will also have been called by > > ProcessUtilitySlow(). Maybe it's sufficient to mention that ? > > Hm. Could we get rid of count_leaf_partitions by doing the work in > ProcessUtilitySlow? Or at least passing that OID list forward instead > of recomputing it? count_leaf_partitions() is called in two places: Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL. It'd be easy enough to pass an integer total via IndexStmt (but I think we wanted to avoid adding anything there, since it's not a part of the statement). count_leaf_partitions() is also called for sub-partitions, in the case that a matching "partitioned index" already exists, and the progress report needs to be incremented by the number of leaves for which indexes were ATTACHED. We'd need a mapping from OID => npartitions (or to compile some data structure of all the partitioned partitions). I guess CreateIndex() could call CreatePartitionDirectory(). But it looks like that would be *more* expensive. -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote: >> Hm. Could we get rid of count_leaf_partitions by doing the work in >> ProcessUtilitySlow? Or at least passing that OID list forward instead >> of recomputing it? > count_leaf_partitions() is called in two places: > Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL. It'd be easy enough to > pass an integer total via IndexStmt (but I think we wanted to avoid > adding anything there, since it's not a part of the statement). I agree that adding such a field to IndexStmt would be a very bad idea. However, adding another parameter to DefineIndex doesn't seem like a problem. Or could we just call pgstat_progress_update_param directly from ProcessUtilitySlow, after counting the partitions in the existing loop? > count_leaf_partitions() is also called for sub-partitions, in the case > that a matching "partitioned index" already exists, and the progress > report needs to be incremented by the number of leaves for which indexes > were ATTACHED. Can't you increment progress by one at the point where the actual attach happens? I also wonder whether leaving non-leaf partitions out of the total is making things more complicated rather than simpler ... > We'd need a mapping from OID => npartitions (or to > compile some data structure of all the partitioned partitions). I guess > CreateIndex() could call CreatePartitionDirectory(). But it looks like > that would be *more* expensive. The reason I find this annoying is that the non-optional nature of the progress reporting mechanism was sold on the basis that it would add only negligible overhead. Adding extra pass(es) over pg_inherits breaks that promise. Maybe it's cheap enough to not matter in the big scheme of things, but we should not be having to make arguments like that one. regards, tom lane
I wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: >> count_leaf_partitions() is also called for sub-partitions, in the case >> that a matching "partitioned index" already exists, and the progress >> report needs to be incremented by the number of leaves for which indexes >> were ATTACHED. > Can't you increment progress by one at the point where the actual attach > happens? Oh, never mind; now I realize that the point is that you didn't ever iterate over those leaf indexes. However, consider a thought experiment: assume for whatever reason that all the actual index builds happen first, then all the cases where you succeed in attaching a sub-partitioned index happen at the end of the command. In that case, the percentage-done indicator would go from some-number to 100% more or less instantly. What if we simply do nothing at sub-partitioned indexes? Or if that's slightly too radical, just increase the PARTITIONS_DONE counter by 1? That would look indistinguishable from the case where all the attaches happen at the end. regards, tom lane
On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote: > >> Hm. Could we get rid of count_leaf_partitions by doing the work in > >> ProcessUtilitySlow? Or at least passing that OID list forward instead > >> of recomputing it? > > > count_leaf_partitions() is called in two places: > > > Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL. It'd be easy enough to > > pass an integer total via IndexStmt (but I think we wanted to avoid > > adding anything there, since it's not a part of the statement). > > I agree that adding such a field to IndexStmt would be a very bad idea. > However, adding another parameter to DefineIndex doesn't seem like a > problem. It's a problem since this is a bug and it's desirable to backpatch a fix, right ? > Or could we just call pgstat_progress_update_param directly from > ProcessUtilitySlow, after counting the partitions in the existing loop? That'd be fine if it was only needed for TOTAL, but it doesn't handle the 2nd call to count_leaf_partitions(). On Sun, Mar 12, 2023 at 08:15:28PM -0400, Tom Lane wrote: > I wrote: > > Justin Pryzby <pryzby@telsasoft.com> writes: > >> count_leaf_partitions() is also called for sub-partitions, in the case > >> that a matching "partitioned index" already exists, and the progress > >> report needs to be incremented by the number of leaves for which indexes > >> were ATTACHED. > > > Can't you increment progress by one at the point where the actual attach > > happens? > > Oh, never mind; now I realize that the point is that you didn't ever > iterate over those leaf indexes. > > However, consider a thought experiment: assume for whatever reason that > all the actual index builds happen first, then all the cases where you > succeed in attaching a sub-partitioned index happen at the end of the > command. In that case, the percentage-done indicator would go from > some-number to 100% more or less instantly. > > What if we simply do nothing at sub-partitioned indexes? Or if that's > slightly too radical, just increase the PARTITIONS_DONE counter by 1? > That would look indistinguishable from the case where all the attaches > happen at the end. Incrementing by 0 sounds terrible, since someone who has intermediate partitioned tables is likely to always see 0% done. (It's true that intermediate partitioned tables don't seem to have been considered by the original patch, and it's indisputable that progress reporting currently misbehaves in that case). And incrementing PARTITIONS_DONE by 1 could lead to bogus progress reporting with "N_done > N_Total" if an intermediate partitioned table had no leaf partitions at all. That's one of the problems this thread is trying to fix (the other being "total changing in the middle of the command"). Maybe your idea is usable though, since indirect partitioned indexes *can* be counted correctly during recursion. What's hard to fix is the case that an index is both *partitioned* and *attached*. Maybe it's okay to count that case as 0. The consequence is that the command would end before the progress report got to 100%. The other option seems to be to define the progress report to count only *direct* children. https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com > The reason I find this annoying is that the non-optional nature of the > progress reporting mechanism was sold on the basis that it would add > only negligible overhead. Adding extra pass(es) over pg_inherits > breaks that promise. Maybe it's cheap enough to not matter in the > big scheme of things, but we should not be having to make arguments > like that one. If someone is running a DDL command involving nested partitions, I'm not so concerned about the cost of additional scans of pg_inherits. They either have enough data to justify partitioning partitions, or they're doing something silly. -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote: >> I agree that adding such a field to IndexStmt would be a very bad idea. >> However, adding another parameter to DefineIndex doesn't seem like a >> problem. > It's a problem since this is a bug and it's desirable to backpatch a > fix, right ? I do not think this is important enough to justify a back-patch. > Incrementing by 0 sounds terrible, since someone who has intermediate > partitioned tables is likely to always see 0% done. How so? The counter will increase after there's some actual work done, ie building an index. If there's no indexes to build then it hardly matters, because the command will complete in very little time. > And incrementing PARTITIONS_DONE by 1 could lead to bogus progress > reporting with "N_done > N_Total" if an intermediate partitioned table > had no leaf partitions at all. Well, we could fix that if we made TOTAL be the total number of descendants rather than just the leaves ;-). But I think not incrementing is probably better. regards, tom lane
On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote: > >> I agree that adding such a field to IndexStmt would be a very bad idea. > >> However, adding another parameter to DefineIndex doesn't seem like a > >> problem. > > > It's a problem since this is a bug and it's desirable to backpatch a > > fix, right ? > > I do not think this is important enough to justify a back-patch. That's fine with me, but it comes as a surprise, and it might invalidate earlier discussions, which were working under the constraint of maintaining a compatible ABI. > > Incrementing by 0 sounds terrible, since someone who has intermediate > > partitioned tables is likely to always see 0% done. > > How so? The counter will increase after there's some actual work done, > ie building an index. If there's no indexes to build then it hardly > matters, because the command will complete in very little time. I misunderstood your idea as suggesting to skip progress reporting for *every* intermediate partitioned index, and its leaves. But I guess what you meant is to skip progress reporting when ATTACHing an intermediate partitioned index. That seems okay, since 1) intermediate partitioned tables are probably rare, and 2) ATTACH is fast, so the effect is indistinguisable from querying the progress report a few moments later. The idea would be for: 1) TOTAL to show the number of direct and indirect leaf partitions; 2) update progress while building direct or indirect indexes; 3) ATTACHing intermediate partitioned tables to increment by 0; 4) ATTACHing a direct child should continue to increment by 1, since that common case already works as expected and shouldn't be changed. The only change from the current patch is (3). (1) still calls count_leaf_partitions(), but only once. I'd prefer that to rearranging the progress reporting to set the TOTAL in ProcessUtilitySlow(). -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > The idea would be for: > 1) TOTAL to show the number of direct and indirect leaf partitions; > 2) update progress while building direct or indirect indexes; > 3) ATTACHing intermediate partitioned tables to increment by 0; > 4) ATTACHing a direct child should continue to increment by 1, > since that common case already works as expected and shouldn't be > changed. OK. > The only change from the current patch is (3). (1) still calls > count_leaf_partitions(), but only once. I'd prefer that to rearranging > the progress reporting to set the TOTAL in ProcessUtilitySlow(). I don't agree with that. find_all_inheritors is fairly expensive and it seems completely silly to do it twice just to avoid adding a parameter to DefineIndex. regards, tom lane
> 14 марта 2023 г., в 18:34, Justin Pryzby <pryzby@telsasoft.com> написал(а): > > On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote: >> Justin Pryzby <pryzby@telsasoft.com> writes: >>> On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote: >>>> I agree that adding such a field to IndexStmt would be a very bad idea. >>>> However, adding another parameter to DefineIndex doesn't seem like a >>>> problem. >> >>> It's a problem since this is a bug and it's desirable to backpatch a >>> fix, right ? >> >> I do not think this is important enough to justify a back-patch. > > That's fine with me, but it comes as a surprise, and it might invalidate > earlier discussions, which were working under the constraint of > maintaining a compatible ABI. > >>> Incrementing by 0 sounds terrible, since someone who has intermediate >>> partitioned tables is likely to always see 0% done. >> >> How so? The counter will increase after there's some actual work done, >> ie building an index. If there's no indexes to build then it hardly >> matters, because the command will complete in very little time. > > I misunderstood your idea as suggesting to skip progress reporting for > *every* intermediate partitioned index, and its leaves. But I guess > what you meant is to skip progress reporting when ATTACHing an > intermediate partitioned index. That seems okay, since 1) intermediate > partitioned tables are probably rare, and 2) ATTACH is fast, so the > effect is indistinguisable from querying the progress report a few > moments later. +1 that counting attached partitioned indexes as 0 is fine. > The idea would be for: > 1) TOTAL to show the number of direct and indirect leaf partitions; > 2) update progress while building direct or indirect indexes; > 3) ATTACHing intermediate partitioned tables to increment by 0; > 4) ATTACHing a direct child should continue to increment by 1, > since that common case already works as expected and shouldn't be > changed. > > The only change from the current patch is (3). (1) still calls > count_leaf_partitions(), but only once. I'd prefer that to rearranging > the progress reporting to set the TOTAL in ProcessUtilitySlow(). > > -- > Justin As for reusing TOTAL calculated outside of DefineIndex, as I can see, ProcessUtilitySlow is not the only call site for DefineIndex(although, I don’t know whether all of them need progress tracking), for instance, there is ALTER TABLE that callsDefineIndex to create index for constraints. So I feel like rearranging progress reporting will result in unnecessarycode duplication in those call sites, so passing in an optional parameter seems to be easier here, if we are goingto optimize it, after all. Especially if back-patching is a non-issue.
On Tue, Mar 14, 2023 at 06:58:14PM +0400, Ilya Gladyshev wrote: > > The only change from the current patch is (3). (1) still calls > > count_leaf_partitions(), but only once. I'd prefer that to rearranging > > the progress reporting to set the TOTAL in ProcessUtilitySlow(). > > As for reusing TOTAL calculated outside of DefineIndex, as I can see, ProcessUtilitySlow is not the only call site forDefineIndex (although, I don’t know whether all of them need progress tracking), for instance, there is ALTER TABLE thatcalls DefineIndex to create index for constraints. So I feel like rearranging progress reporting will result in unnecessarycode duplication in those call sites, so passing in an optional parameter seems to be easier here, if we are goingto optimize it, after all. Especially if back-patching is a non-issue. Yeah. See attached. I don't like duplicating the loop. Is this really the right direction to go ? I haven't verified if the child tables are locked in all the paths which would call count_leaf_partitions(). But why is it important to lock them for this? If they weren't locked before, that'd be a pre-existing problem...
Attachment
> 16 марта 2023 г., в 04:07, Justin Pryzby <pryzby@telsasoft.com> написал(а): > > On Tue, Mar 14, 2023 at 06:58:14PM +0400, Ilya Gladyshev wrote: >>> The only change from the current patch is (3). (1) still calls >>> count_leaf_partitions(), but only once. I'd prefer that to rearranging >>> the progress reporting to set the TOTAL in ProcessUtilitySlow(). >> >> As for reusing TOTAL calculated outside of DefineIndex, as I can see, ProcessUtilitySlow is not the only call site forDefineIndex (although, I don’t know whether all of them need progress tracking), for instance, there is ALTER TABLE thatcalls DefineIndex to create index for constraints. So I feel like rearranging progress reporting will result in unnecessarycode duplication in those call sites, so passing in an optional parameter seems to be easier here, if we are goingto optimize it, after all. Especially if back-patching is a non-issue. > > Yeah. See attached. I don't like duplicating the loop. Is this really > the right direction to go ? > > I haven't verified if the child tables are locked in all the paths which > would call count_leaf_partitions(). But why is it important to lock > them for this? If they weren't locked before, that'd be a pre-existing > problem... > <0001-fix-CREATE-INDEX-progress-report-with-nested-partiti.patch> I’m not sure what the general policy on locking is, but I have checked ALTER TABLE ADD INDEX, and the all the partitionsseem to be locked on the first entry to DefineIndex there. All other call sites pass in the parentIndexId, whichmeans the progress tracking machinery will not be initialized, so I think, we don’t need to do locking in count_leaf_partitions(). The approach in the patch looks good to me. Some nitpicks on the patch: 1. There’s an unnecessary second call to get_rel_relkind in ProcessUtilitySlow, we can just use what’s in the variable relkind. 2. We can also combine else and if to have one less nested level like that: + else if (!RELKIND_HAS_PARTITIONS(child_relkind)) 3. There was a part of the comment saying "If the index was built by calling DefineIndex() recursively, the called functionis responsible for updating the progress report for built indexes.", I think it is still useful to have it there.
On Thu, Mar 16, 2023 at 07:04:16PM +0400, Ilya Gladyshev wrote: > > 16 марта 2023 г., в 04:07, Justin Pryzby <pryzby@telsasoft.com> написал(а): > > > > On Tue, Mar 14, 2023 at 06:58:14PM +0400, Ilya Gladyshev wrote: > >>> The only change from the current patch is (3). (1) still calls > >>> count_leaf_partitions(), but only once. I'd prefer that to rearranging > >>> the progress reporting to set the TOTAL in ProcessUtilitySlow(). > >> > >> As for reusing TOTAL calculated outside of DefineIndex, as I can see, ProcessUtilitySlow is not the only call site forDefineIndex (although, I don’t know whether all of them need progress tracking), for instance, there is ALTER TABLE thatcalls DefineIndex to create index for constraints. So I feel like rearranging progress reporting will result in unnecessarycode duplication in those call sites, so passing in an optional parameter seems to be easier here, if we are goingto optimize it, after all. Especially if back-patching is a non-issue. > > > > Yeah. See attached. I don't like duplicating the loop. Is this really > > the right direction to go ? > > > > I haven't verified if the child tables are locked in all the paths which > > would call count_leaf_partitions(). But why is it important to lock > > them for this? If they weren't locked before, that'd be a pre-existing > > problem... > > <0001-fix-CREATE-INDEX-progress-report-with-nested-partiti.patch> > > I’m not sure what the general policy on locking is, but I have checked ALTER TABLE ADD INDEX, and the all the partitionsseem to be locked on the first entry to DefineIndex there. All other call sites pass in the parentIndexId, whichmeans the progress tracking machinery will not be initialized, so I think, we don’t need to do locking in count_leaf_partitions(). > The approach in the patch looks good to me. Some nitpicks on the patch: > 1. There’s an unnecessary second call to get_rel_relkind in ProcessUtilitySlow, we can just use what’s in the variablerelkind. > 2. We can also combine else and if to have one less nested level like that: > + else if (!RELKIND_HAS_PARTITIONS(child_relkind)) > > 3. There was a part of the comment saying "If the index was built by calling DefineIndex() recursively, the called functionis responsible for updating the progress report for built indexes.", I think it is still useful to have it there. Thanks, I addressed (1) and (3). (2) is deliberate, to allow a place to put the comment which is not specific to the !HAS_PARTITIONS case. -- Justin
Attachment
So I'm still pretty desperately unhappy with count_leaf_partitions. I don't like expending significant cost purely for progress tracking purposes, I don't like the undocumented assumption that NoLock is safe, and what's more, if it is safe then we've already traversed the inheritance tree to lock everything so in principle we could have the count already. However, it does seem like getting that knowledge from point A to point B would be a mess in most places. One thing we could do to reduce the cost (and improve the safety) is to forget the idea of checking the relkinds and just set the PARTITIONS_TOTAL count to list_length() of the find_all_inheritors result. We've already agreed that it's okay if the PARTITIONS_DONE count never reaches PARTITIONS_TOTAL, so this would just be taking that idea further. (Or we could increment PARTITIONS_DONE for non-leaf partitions when we visit them, thus making that TOTAL more nearly correct.) Furthermore, as things stand it's not hard for PARTITIONS_TOTAL to be zero --- there's at least one such case in the regression tests --- and that seems just weird to me. By the by, this is awful code: + if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid))) Consult the definition of RELKIND_HAS_STORAGE to see why. But I want to get rid of that rather than fixing it. regards, tom lane
On Thu, Mar 23, 2023 at 04:35:46PM -0400, Tom Lane wrote: > So I'm still pretty desperately unhappy with count_leaf_partitions. > I don't like expending significant cost purely for progress tracking > purposes, I don't like the undocumented assumption that NoLock is > safe, and what's more, if it is safe then we've already traversed > the inheritance tree to lock everything so in principle we could > have the count already. However, it does seem like getting that > knowledge from point A to point B would be a mess in most places. > > One thing we could do to reduce the cost (and improve the safety) > is to forget the idea of checking the relkinds and just set the > PARTITIONS_TOTAL count to list_length() of the find_all_inheritors > result. Actually list_length() minus 1 ... > We've already agreed that it's okay if the PARTITIONS_DONE > count never reaches PARTITIONS_TOTAL, so this would just be taking > that idea further. (Or we could increment PARTITIONS_DONE for > non-leaf partitions when we visit them, thus making that TOTAL > more nearly correct.) Yes, I think that's actually more correct. If TOTAL is set without regard to relkind, then DONE ought to be set the same way. I updated the documentation to indicate that the counters include the intermediate partitioned rels, but I wonder if it's better to say nothing and leave that undefined. > Furthermore, as things stand it's not hard > for PARTITIONS_TOTAL to be zero --- there's at least one such case > in the regression tests --- and that seems just weird to me. I don't know why it'd seem weird. postgres doesn't create partitions automatically, so by default there are none. If we create a table but never load any data, it'll have no partitions. Also, the TOTAL=0 case won't go away just because we start counting intermediate partitions. > By the by, this is awful code: > > + if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid))) > > Consult the definition of RELKIND_HAS_STORAGE to see why. > But I want to get rid of that rather than fixing it. Good point, but I'd burden-shift the blame to RELKIND_HAS_STORAGE(). BTW, I promoted myself to a co-author of the patch. My interest here is to resolve this hoping to allow the CIC patch to progress. -- Justin
Attachment
Justin Pryzby <pryzby@telsasoft.com> writes: > On Thu, Mar 23, 2023 at 04:35:46PM -0400, Tom Lane wrote: >> Furthermore, as things stand it's not hard >> for PARTITIONS_TOTAL to be zero --- there's at least one such case >> in the regression tests --- and that seems just weird to me. > I don't know why it'd seem weird. postgres doesn't create partitions > automatically, so by default there are none. If we create a table but > never load any data, it'll have no partitions. My problem with it is that it's not clear how to tell "no partitioned index creation in progress" from "partitioned index creation in progress, but total = 0". Maybe there's some out-of-band way to tell that in the stats reporting system, but still it's a weird corner case. > Also, the TOTAL=0 case > won't go away just because we start counting intermediate partitions. That's why I wanted list_length() not list_length() - 1. We are doing *something* at the top partitioned table, it just doesn't involve a table scan, so I don't find this totally unreasonable. If you agree we are doing work at intermediate partitioned tables, how are we not doing work at the top one? regards, tom lane
On Sat, Mar 25, 2023 at 11:55:13AM -0400, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Thu, Mar 23, 2023 at 04:35:46PM -0400, Tom Lane wrote: > >> Furthermore, as things stand it's not hard > >> for PARTITIONS_TOTAL to be zero --- there's at least one such case > >> in the regression tests --- and that seems just weird to me. > > > I don't know why it'd seem weird. postgres doesn't create partitions > > automatically, so by default there are none. If we create a table but > > never load any data, it'll have no partitions. > > My problem with it is that it's not clear how to tell "no partitioned > index creation in progress" from "partitioned index creation in progress, > but total = 0". Maybe there's some out-of-band way to tell that in the > stats reporting system, but still it's a weird corner case. > > > Also, the TOTAL=0 case > > won't go away just because we start counting intermediate partitions. > > That's why I wanted list_length() not list_length() - 1. We are > doing *something* at the top partitioned table, it just doesn't > involve a table scan, so I don't find this totally unreasonable. > If you agree we are doing work at intermediate partitioned tables, > how are we not doing work at the top one? What you're proposing would redefine the meaning of PARTITIONS_DONE/TOTAL, even in the absence of intermediate partitioned tables. Which might be okay, but the scope of this thread/patch was to fix the behavior involving intermediate partitioned tables. It's somewhat weird to me that find_all_inheritors(rel) returns the rel itself. But it's an internal function, and evidently that's what's needed/desirable to do, so that's fine. However, "PARTITIONS_TOTAL" has a certain user-facing definition, and "Number of partitions" is easier to explain than "Number of partitions plus the rel itself", and IMO an easier definition is a better one. Your complaint seems similar to something I've said a few times before: it's weird to expose macroscopic progress reporting of partitioned tables in the same view and in the same *row* as microscopic progress of its partitions. But changing that is a job for another patch. I won't be opposed to it if someone were to propose a patch to remove partitions_{done,total}. See also: https://www.postgresql.org/message-id/flat/YCy5ZMt8xAyoOMmv%40paquier.xyz#b20d1be226a93dacd3fd40b402315105 -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sat, Mar 25, 2023 at 11:55:13AM -0400, Tom Lane wrote: >> That's why I wanted list_length() not list_length() - 1. We are >> doing *something* at the top partitioned table, it just doesn't >> involve a table scan, so I don't find this totally unreasonable. >> If you agree we are doing work at intermediate partitioned tables, >> how are we not doing work at the top one? > What you're proposing would redefine the meaning of > PARTITIONS_DONE/TOTAL, even in the absence of intermediate partitioned > tables. Which might be okay, but the scope of this thread/patch was to > fix the behavior involving intermediate partitioned tables. I'm a little skeptical of that argument, because this patch is already redefining the meaning of PARTITIONS_TOTAL. The fact that the existing documentation is vague enough to be read either way doesn't make it not a change. Still, in the interests of getting something done I'll drop the issue. regards, tom lane
I pushed 0001 with some cosmetic changes (for instance, trying to make the style of the doc entries for partitions_total/partitions_done match the rest of their table). I'm not touching 0002 or 0003, because I think they're fundamentally a bad idea. Progress reporting is inherently inexact, because it's so hard to predict the amount of work to be done in advance -- have you ever seen a system anywhere whose progress bars reliably advance at a uniform rate? I think adding assertions that the estimates are error-free is just going to cause headaches. As an example, I added a comment pointing out that the current fix won't crash and burn if the caller failed to lock all the child tables in advance: the find_all_inheritors call should be safe anyway, so the worst consequence would be an imprecise partitions_total estimate. But that argument falls down if we're going to add assertions that partitions_total isn't in error. regards, tom lane
On Sat, Mar 25, 2023 at 03:43:32PM -0400, Tom Lane wrote: > I pushed 0001 with some cosmetic changes (for instance, trying to > make the style of the doc entries for partitions_total/partitions_done > match the rest of their table). Thanks. > I'm not touching 0002 or 0003, because I think they're fundamentally > a bad idea. Progress reporting is inherently inexact, because it's Nobody could disagree that it's inexact. The assertions are for minimal sanity tests and consistency. Like if "total" is set multiple times (as in this patch), or if a progress value goes backwards. Anyway the assertions exposed two other issues that would need to be fixed before the assertions themselves could be proposed. -- Justin