Thread: hyrax vs. RelationBuildPartitionDesc
Hi, Amit Kapila pointed out to be that there are some buidfarm failures on hyrax which seem to have started happening around the time I committed 898e5e3290a72d288923260143930fb32036c00c. It failed like this once: 2019-03-07 19:57:40.231 EST [28073:11] DETAIL: Failed process was running: /* same as above */ explain (costs off) select * from rlp where a = 1::numeric; And then the next three runs failed like this: 2019-03-09 04:56:04.939 EST [1727:4] LOG: server process (PID 2898) was terminated by signal 9: Killed 2019-03-09 04:56:04.939 EST [1727:5] DETAIL: Failed process was running: UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96; I couldn't think of an explanation for this other than that the process involved had used a lot of memory and gotten killed by the OOM killer, and it turns out that RelationBuildPartitionDesc leaks memory into the surrounding context every time it's called. It's not a lot of memory, but in a CLOBBER_CACHE_ALWAYS build it adds up, because this function gets called a lot. All of this is true even before the commit in question, but for some reason that I don't understand that commit makes it worse. I tried having that function use a temporary context for its scratch space and that causes a massive drop in memory usage when running update.sql frmo the regression tests under CLOBBER_CACHE_ALWAYS. I ran MacOS X's vmmap tool to see the impact, measuring the size of the "DefaultMallocZone". Without this patch, that peaks at >450MB; with this patch it peaks ~270MB. There is a significant reduct in typical memory usage, too. It's noticeably better with this patch than it was before 898e5e3290a72d288923260143930fb32036c00c. I'm not sure whether this is the right way to address the problem. RelationBuildPartitionDesc() creates basically all of the data structures it needs and then copies them into rel->rd_pdcxt, which has always seemed a bit inefficient to me. Another way to redesign this would be to have the function create a temporary context, do all of its work there, and then reparent the context under CacheMemoryContext at the end. That means any leaks would go into a relatively long-lifespan context, but on the other hand you wouldn't leak into the same context a zillion times over, and you'd save the expense of copying everything. I think that the biggest thing that is being copied around here is the partition bounds, so maybe the leak wouldn't amount to much, and we could also do things like list_free(inhoids) to make it a little tighter. Opinions? Suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2019-Mar-13, Robert Haas wrote: > RelationBuildPartitionDesc() creates basically all of the data > structures it needs and then copies them into rel->rd_pdcxt, which has > always seemed a bit inefficient to me. Another way to redesign this > would be to have the function create a temporary context, do all of > its work there, and then reparent the context under CacheMemoryContext > at the end. That means any leaks would go into a relatively > long-lifespan context, but on the other hand you wouldn't leak into > the same context a zillion times over, and you'd save the expense of > copying everything. I think that the biggest thing that is being > copied around here is the partition bounds, so maybe the leak wouldn't > amount to much, and we could also do things like list_free(inhoids) to > make it a little tighter. I remember going over this code's memory allocation strategy a bit to avoid the copy while not incurring potential leaks CacheMemoryContext; as I recall, my idea was to use two contexts, one of which is temporary and used for any potentially leaky callees, and destroyed at the end of the function, and the other contains the good stuff and is reparented to CacheMemoryContext at the end. So if you have any accidental leaks, they don't affect a long-lived context. You have to be mindful of not calling leaky code when you're using the permanent one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I remember going over this code's memory allocation strategy a bit to > avoid the copy while not incurring potential leaks CacheMemoryContext; > as I recall, my idea was to use two contexts, one of which is temporary > and used for any potentially leaky callees, and destroyed at the end of > the function, and the other contains the good stuff and is reparented to > CacheMemoryContext at the end. So if you have any accidental leaks, > they don't affect a long-lived context. You have to be mindful of not > calling leaky code when you're using the permanent one. Well, that assumes that the functions which allocate the good stuff do not also leak, which seems a bit fragile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I remember going over this code's memory allocation strategy a bit to >> avoid the copy while not incurring potential leaks CacheMemoryContext; >> as I recall, my idea was to use two contexts, one of which is temporary >> and used for any potentially leaky callees, and destroyed at the end of >> the function, and the other contains the good stuff and is reparented to >> CacheMemoryContext at the end. So if you have any accidental leaks, >> they don't affect a long-lived context. You have to be mindful of not >> calling leaky code when you're using the permanent one. > Well, that assumes that the functions which allocate the good stuff do > not also leak, which seems a bit fragile. I'm a bit confused as to why there's an issue here at all. The usual plan for computed-on-demand relcache sub-structures is that we compute a working copy that we're going to return to the caller using the caller's context (which is presumably statement-duration at most) and then do the equivalent of copyObject to stash a long-lived copy into the relcache context. Is this case being done differently, and if so why? If it's being done the same, where are we leaking? I recall having noticed someplace where I thought the relcache partition support was simply failing to make provisions for cleaning up a cached structure at relcache entry drop, but I didn't have time to pursue it right then. Let me see if I can reconstruct what I was worried about. regards, tom lane
On Wed, Mar 13, 2019 at 1:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm a bit confused as to why there's an issue here at all. The usual > plan for computed-on-demand relcache sub-structures is that we compute > a working copy that we're going to return to the caller using the > caller's context (which is presumably statement-duration at most) > and then do the equivalent of copyObject to stash a long-lived copy > into the relcache context. Is this case being done differently, and if > so why? If it's being done the same, where are we leaking? It's being done in just that way. The caller's context is MessageContext, which is indeed statement duration. But if you plunk 10k into MessageContext a few thousand times per statement, then you chew through a couple hundred meg of memory, and apparently hyrax can't tolerate that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: > I recall having noticed someplace where I thought the relcache partition > support was simply failing to make provisions for cleaning up a cached > structure at relcache entry drop, but I didn't have time to pursue it > right then. Let me see if I can reconstruct what I was worried about. Ah, here we are: it was rel->rd_partcheck. I'm not sure exactly how complicated that structure can be, but I'm pretty sure that this is a laughably inadequate method of cleaning it up: if (relation->rd_partcheck) pfree(relation->rd_partcheck); Having it be loose data in CacheMemoryContext, which is the current state of affairs, is just not going to be practical to clean up. I suggest that maybe it could be copied into rd_partkeycxt or rd_pdcxt, so that it'd go away as a byproduct of freeing those. If there's a reason it has to be independent of both, it'll have to have its own small context. Dunno if that's related to hyrax's issue, though. regards, tom lane
On 2019-Mar-13, Robert Haas wrote: > On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I remember going over this code's memory allocation strategy a bit to > > avoid the copy while not incurring potential leaks CacheMemoryContext; > > as I recall, my idea was to use two contexts, one of which is temporary > > and used for any potentially leaky callees, and destroyed at the end of > > the function, and the other contains the good stuff and is reparented to > > CacheMemoryContext at the end. So if you have any accidental leaks, > > they don't affect a long-lived context. You have to be mindful of not > > calling leaky code when you're using the permanent one. > > Well, that assumes that the functions which allocate the good stuff do > not also leak, which seems a bit fragile. A bit, yes, but not overly so, and it's less fragile that not having such a protection. Anything that allocates in CacheMemoryContext needs to be very careful anyway. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > A bit, yes, but not overly so, and it's less fragile that not having > such a protection. Anything that allocates in CacheMemoryContext needs > to be very careful anyway. True, but I think it's more fragile than either of the options I proposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 13, 2019 at 1:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ah, here we are: it was rel->rd_partcheck. I'm not sure exactly how > complicated that structure can be, but I'm pretty sure that this is > a laughably inadequate method of cleaning it up: > > if (relation->rd_partcheck) > pfree(relation->rd_partcheck); Oh, for crying out loud. > Dunno if that's related to hyrax's issue, though. It's related in the sense that it's a leak, and any leak will tend to run the system out of memory more easily, but what I observed was a large leak into MessageContext, and that would be a leak into CacheMemoryContext, so I think it's probably a sideshow rather than the main event. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-Mar-13, Robert Haas wrote: > On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > A bit, yes, but not overly so, and it's less fragile that not having > > such a protection. Anything that allocates in CacheMemoryContext needs > > to be very careful anyway. > > True, but I think it's more fragile than either of the options I proposed. You do? Unless I misunderstood, your options are: 1. (the patch you attached) create a temporary memory context that is used for everything, then at the end copy the good stuff to CacheMemCxt (or a sub-context thereof). This still needs to copy. 2. create a temp memory context, do everything there, do retail freeing of everything we don't want, reparenting the context to CacheMemCxt. Hope we didn't forget to pfree anything. How is any of those superior to what I propose? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: >> Dunno if that's related to hyrax's issue, though. > It's related in the sense that it's a leak, and any leak will tend to > run the system out of memory more easily, but what I observed was a > large leak into MessageContext, and that would be a leak into > CacheMemoryContext, so I think it's probably a sideshow rather than > the main event. OK, in that case it's definitely all the temporary data that gets created that is the problem. I've not examined your patch in great detail but it looks plausible for fixing that. I think that RelationBuildPartitionDesc could use some additional cleanup or at least better commenting. In particular, it's neither documented nor obvious to the naked eye why rel->rd_partdesc mustn't get set till the very end. As the complainant, I'm willing to go fix that, but do you want to push your patch first so it doesn't get broken? Or I could include your patch in the cleanup. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > You do? Unless I misunderstood, your options are: > 1. (the patch you attached) create a temporary memory context that is > used for everything, then at the end copy the good stuff to CacheMemCxt > (or a sub-context thereof). This still needs to copy. > 2. create a temp memory context, do everything there, do retail freeing > of everything we don't want, reparenting the context to CacheMemCxt. > Hope we didn't forget to pfree anything. > How is any of those superior to what I propose? I doubt that what you're suggesting is terribly workable. It's not just RelationBuildPartitionDesc that's at issue. Part of what will be the long-lived data structure is made by partition_bounds_create, and that invokes quite a boatload of code that both makes the final data structure and leaks a lot of intermediate junk. Having to be very careful about permanent vs temporary data throughout all of that seems like a recipe for bugs. The existing code in RelationBuildPartitionDesc is already pretty good about avoiding copying of data other than the output of partition_bounds_create. In fact, I think it's already close to what you're suggesting other than that point. So I think --- particularly given that we need something we can back-patch into v11 --- that we shouldn't try to do anything much more complicated than what Robert is suggesting. regards, tom lane
On Wed, Mar 13, 2019 at 2:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Mar-13, Robert Haas wrote: > > On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > A bit, yes, but not overly so, and it's less fragile that not having > > > such a protection. Anything that allocates in CacheMemoryContext needs > > > to be very careful anyway. > > > > True, but I think it's more fragile than either of the options I proposed. > > You do? Unless I misunderstood, your options are: > > 1. (the patch you attached) create a temporary memory context that is > used for everything, then at the end copy the good stuff to CacheMemCxt > (or a sub-context thereof). This still needs to copy. > > 2. create a temp memory context, do everything there, do retail freeing > of everything we don't want, reparenting the context to CacheMemCxt. > Hope we didn't forget to pfree anything. > > How is any of those superior to what I propose? Well, *I* thought of it, so obviously it must be superior. :-) More seriously, your idea does seem better in some ways. My #1 doesn't avoid the copy, but does kill the leaks. My #2 avoids the copy, but risks a different flavor of leakage. Your idea also avoids the copy and leaks in fewer cases than my #2. In that sense, it is the technically superior option. However, it also requires more memory context switches than either of my options, and I guess that seems fragile to me in the sense that I think future code changes are more likely to go wrong due to the complexity involved. I might be mistaken about that, though. One other note is that, although the extra copy looks irksome, it's probably not very significant from a performance point of view. In a non-CLOBBER_CACHE_ALWAYS build it doesn't happen frequently enough to matter, and in a CLOBBER_CACHE_ALWAYS build everything is already so slow that it still doesn't matter. That's not the only consideration, though. Code which copies data structures might be buggy, or might develop bugs in the future, and avoiding that copy would avoid exposure to such bugs. On the other hand, it's not really possible to remove the copying without increasing the risk of leaking into the long-lived context. In some ways, I think this is all a natural outgrowth of the fact that we rely on palloc() in so many places instead of forcing code to be explicit about which memory context it intends to target. Global variables are considered harmful, and it's not that difficult to see the connection between that general principle and present difficulties. However, I will not have time to write and debug a patch reversing that choice between now and feature freeze. :-) I'm kinda inclined to go for the brute-force approach of slamming the temporary context in there as in the patch I proposed upthread, which should solve the immediate problem, and we can implement one of these other ideas later if we want. What do you think about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 13, 2019 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > OK, in that case it's definitely all the temporary data that gets created > that is the problem. I've not examined your patch in great detail but > it looks plausible for fixing that. Cool. > I think that RelationBuildPartitionDesc could use some additional cleanup > or at least better commenting. In particular, it's neither documented nor > obvious to the naked eye why rel->rd_partdesc mustn't get set till the > very end. As the complainant, I'm willing to go fix that, but do you want > to push your patch first so it doesn't get broken? Or I could include > your patch in the cleanup. Yeah, that probably makes sense, though it might be polite to wait another hour or two to see if anyone wants to argue with that approach further. It seems kinda obvious to me why rel->rd_partdesc can't get set until the end. Isn't it just that you'd better not set a permanent pointer to a data structure until you're past any code that might ERROR, which is pretty much everything? That principle applies to lots of PostgreSQL code, not just this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Mar 13, 2019 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think that RelationBuildPartitionDesc could use some additional cleanup >> or at least better commenting. In particular, it's neither documented nor >> obvious to the naked eye why rel->rd_partdesc mustn't get set till the >> very end. As the complainant, I'm willing to go fix that, but do you want >> to push your patch first so it doesn't get broken? Or I could include >> your patch in the cleanup. > It seems kinda obvious to me why rel->rd_partdesc can't get set until > the end. Isn't it just that you'd better not set a permanent pointer > to a data structure until you're past any code that might ERROR, which > is pretty much everything? That principle applies to lots of > PostgreSQL code, not just this. Yeah, but usually there's some comment pointing it out. I also wonder if there aren't corner-case bugs; it seems a bit bogus for example that rd_pdcxt is created without any thought as to whether it might be set already. It's not clear whether this has been written with the level of paranoia that's appropriate for messing with a relcache entry, and some comments would make it a lot clearer (a) if that is true and (b) what assumptions are implicitly being shared with relcache.c. Meanwhile, who's going to take point on cleaning up rd_partcheck? I don't really understand this code well enough to know whether that can share one of the existing partitioning-related sub-contexts. regards, tom lane
On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, but usually there's some comment pointing it out. I also wonder > if there aren't corner-case bugs; it seems a bit bogus for example that > rd_pdcxt is created without any thought as to whether it might be set > already. It's not clear whether this has been written with the > level of paranoia that's appropriate for messing with a relcache entry, > and some comments would make it a lot clearer (a) if that is true and > (b) what assumptions are implicitly being shared with relcache.c. Yeah, this is all pretty new code, and it probably has some bugs we haven't found yet. > Meanwhile, who's going to take point on cleaning up rd_partcheck? > I don't really understand this code well enough to know whether that > can share one of the existing partitioning-related sub-contexts. Well, it would be nice if Amit Langote worked on it since he wrote the original version of most of this code, but I'm sure he'll defer to you if you feel the urge to work on it, or I can take a look at it (not today). To your question, I think it probably can't share a context. Briefly, rd_partkey can't change ever, except that when a partitioned relation is in the process of being created it is briefly NULL; once it obtains a value, that value cannot be changed. If you want to range-partition a list-partitioned table or something like that, you have to drop the table and create a new one. I think that's a perfectly acceptable permanent limitation and I have no urge whatever to change it. rd_partdesc changes when you attach or detach a child partition. rd_partcheck is the reverse: it changes when you attach or detach this partition to or from a parent. It's probably helpful to think of the case of a table with partitions each of which is itself partitioned -- the table at that middle level has to worry both about gaining or losing children and about being ripped away from its parent. As a parenthetical note, I observe that relcache.c seems to know almost nothing about rd_partcheck. rd_partkey and rd_partdesc both have handling in RelationClearRelation(), but rd_partcheck does not, and I suspect that's wrong. So the problems are probably not confined to the relcache-drop-time problem that you observed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Meanwhile, who's going to take point on cleaning up rd_partcheck? >> I don't really understand this code well enough to know whether that >> can share one of the existing partitioning-related sub-contexts. > To your question, I think it probably can't share a context. Briefly, > rd_partkey can't change ever, except that when a partitioned relation > is in the process of being created it is briefly NULL; once it obtains > a value, that value cannot be changed. If you want to range-partition > a list-partitioned table or something like that, you have to drop the > table and create a new one. I think that's a perfectly acceptable > permanent limitation and I have no urge whatever to change it. > rd_partdesc changes when you attach or detach a child partition. > rd_partcheck is the reverse: it changes when you attach or detach this > partition to or from a parent. Got it. Yeah, it seems like the clearest and least bug-prone solution is for those to be in three separate sub-contexts. The only reason I was trying to avoid that was the angle of how to back-patch into 11. However, we can just add the additional context pointer field at the end of the Relation struct in v11, and that should be good enough to avoid ABI problems. Off topic for the moment, since this clearly wouldn't be back-patch material, but I'm starting to wonder if we should just have a context for each relcache entry and get rid of most or all of the retail cleanup logic in RelationDestroyRelation ... regards, tom lane
On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Off topic for the moment, since this clearly wouldn't be back-patch > material, but I'm starting to wonder if we should just have a context > for each relcache entry and get rid of most or all of the retail > cleanup logic in RelationDestroyRelation ... I think that idea might have a lot of merit, but I haven't studied it closely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Off topic for the moment, since this clearly wouldn't be back-patch > > material, but I'm starting to wonder if we should just have a context > > for each relcache entry and get rid of most or all of the retail > > cleanup logic in RelationDestroyRelation ... > > I think that idea might have a lot of merit, but I haven't studied it closely. It just occurred to me that one advantage of this would be that you could see how much memory was being used by each relcache entry using MemoryContextStats(), which seems super-appealing. In fact, what about getting rid of all allocations in CacheMemoryContext itself in favor of some more specific context in each case? That would make it a lot clearer where to look for leaks -- or efficiencies. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-03-13 16:50:53 -0400, Robert Haas wrote: > On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Off topic for the moment, since this clearly wouldn't be back-patch > > > material, but I'm starting to wonder if we should just have a context > > > for each relcache entry and get rid of most or all of the retail > > > cleanup logic in RelationDestroyRelation ... > > > > I think that idea might have a lot of merit, but I haven't studied it closely. > > It just occurred to me that one advantage of this would be that you > could see how much memory was being used by each relcache entry using > MemoryContextStats(), which seems super-appealing. In fact, what > about getting rid of all allocations in CacheMemoryContext itself in > favor of some more specific context in each case? That would make it > a lot clearer where to look for leaks -- or efficiencies. But it might also make it frustrating to look at memory context dumps - we'd suddenly have many many more memory context lines we'd displayed, right? Wouldn't that often make the dump extremely long? To be clear, I think the idea has merit. Just want to raise the above point. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-03-13 16:50:53 -0400, Robert Haas wrote: >> On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Off topic for the moment, since this clearly wouldn't be back-patch >>>> material, but I'm starting to wonder if we should just have a context >>>> for each relcache entry and get rid of most or all of the retail >>>> cleanup logic in RelationDestroyRelation ... >> It just occurred to me that one advantage of this would be that you >> could see how much memory was being used by each relcache entry using >> MemoryContextStats(), which seems super-appealing. > But it might also make it frustrating to look at memory context dumps - > we'd suddenly have many many more memory context lines we'd displayed, > right? Wouldn't that often make the dump extremely long? There's already a mechanism in there to suppress child contexts after 100 or so, which would almost inevitably kick in on the relcache if we did this. So I don't believe we'd have a problem with the context dumps getting too long --- more likely, the complaints would be the reverse. My gut feeling is that right now relcache entries tend to be mas-o-menos the same size, except for stuff that is *already* in sub-contexts, like index and partition descriptors. So I'm not that excited about this adding useful info to context dumps. I was just looking at it as a way to make relcache entry cleanup simpler and less leak-prone. Having said that, I do agree that CacheMemoryContext is too much of an undifferentiated blob right now, and splitting it up seems like it'd be good for accountability. I'd definitely be +1 for a catcache vs. relcache vs. other caches split. You could imagine per-catcache contexts, too. The main limiting factor here is that the per-context overhead could get excessive. regards, tom lane
Hi, On 2019-03-13 17:10:55 -0400, Tom Lane wrote: > There's already a mechanism in there to suppress child contexts after > 100 or so, which would almost inevitably kick in on the relcache if we > did this. So I don't believe we'd have a problem with the context dumps > getting too long --- more likely, the complaints would be the reverse. Well, that's two sides of the same coin. > Having said that, I do agree that CacheMemoryContext is too much of an > undifferentiated blob right now, and splitting it up seems like it'd be > good for accountability. I'd definitely be +1 for a catcache vs. relcache > vs. other caches split. That'd make a lot of sense. > You could imagine per-catcache contexts, too. > The main limiting factor here is that the per-context overhead could get > excessive. Yea, per relcache entry contexts seem like they'd get really expensive fast. Even per-catcache seems like it might be noticable additional overhead for a new backend. Greetings, Andres Freund
On 2019/03/14 5:18, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Meanwhile, who's going to take point on cleaning up rd_partcheck? >>> I don't really understand this code well enough to know whether that >>> can share one of the existing partitioning-related sub-contexts. > >> To your question, I think it probably can't share a context. Briefly, >> rd_partkey can't change ever, except that when a partitioned relation >> is in the process of being created it is briefly NULL; once it obtains >> a value, that value cannot be changed. If you want to range-partition >> a list-partitioned table or something like that, you have to drop the >> table and create a new one. I think that's a perfectly acceptable >> permanent limitation and I have no urge whatever to change it. >> rd_partdesc changes when you attach or detach a child partition. >> rd_partcheck is the reverse: it changes when you attach or detach this >> partition to or from a parent. > > Got it. Yeah, it seems like the clearest and least bug-prone solution > is for those to be in three separate sub-contexts. The only reason > I was trying to avoid that was the angle of how to back-patch into 11. > However, we can just add the additional context pointer field at the > end of the Relation struct in v11, and that should be good enough to > avoid ABI problems. Agree that rd_partcheck needs its own context as you have complained in the past [1]. I think we'll need to back-patch this fix to PG 10 as well. I've attached patches for all 3 branches. Thanks, Amit [1] https://www.postgresql.org/message-id/22236.1523374067%40sss.pgh.pa.us
Attachment
On Wed, Mar 13, 2019 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, that probably makes sense, though it might be polite to wait > another hour or two to see if anyone wants to argue with that approach > further. Hearing nobody, done. If someone wants to argue more we can always change it later. I did not back-patch, because the code is in a different file in v11, none of the hunks of the patch apply on v11, and v11 is not failing on hyrax. But feel free to do it if you feel strongly about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I did not back-patch, because the code is in a different file in v11, > none of the hunks of the patch apply on v11, and v11 is not failing on > hyrax. Hmm, I wonder why not. I suppose the answer is that the leak is worse in HEAD than before, but how come? I followed your reference to 898e5e329, and I've got to say that the hunk it adds in relcache.c looks fishy as can be. The argument that the rd_pdcxt "will get cleaned up eventually" reads to me like "this leaks memory like a sieve", especially in the repeated-rebuild scenario which is exactly what CLOBBER_CACHE_ALWAYS would provoke. Probably the only thing that keeps it from being effectively a session-lifespan leak is that CCA will generally result in relcache entries being flushed entirely as soon as their refcount goes to 0. Still, because of that, I wouldn't think it'd move the needle very much on a CCA animal; so my guess is that there's something else. regards, tom lane
On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm, I wonder why not. I suppose the answer is that > the leak is worse in HEAD than before, but how come? I'd like to know that, too, but right now I don't. > I followed your reference to 898e5e329, and I've got to say that > the hunk it adds in relcache.c looks fishy as can be. The argument > that the rd_pdcxt "will get cleaned up eventually" reads to me like > "this leaks memory like a sieve", especially in the repeated-rebuild > scenario which is exactly what CLOBBER_CACHE_ALWAYS would provoke. > Probably the only thing that keeps it from being effectively a > session-lifespan leak is that CCA will generally result in relcache > entries being flushed entirely as soon as their refcount goes to 0. > Still, because of that, I wouldn't think it'd move the needle very > much on a CCA animal; so my guess is that there's something else. I'm a little uncertain of that logic, too, but keep in mind that if keep_partdesc is true, we're just going to throw away the newly-build data and keep the old stuff. So I think that in order for this to actually be a problem, you would have to have other sessions that repeatedly alter the partitiondesc via ATTACH PARTITION, and at the same time, the current session would have to keep on reading it. But you also have to never add a partition while the local session is between queries, because if you do, rebuild will be false and we'll blow away the old entry in its entirety and any old contexts that are hanging off of it will be destroyed as well. So it seems a little ticklish to me to figure out a realistic scenario in which we actually leak enough memory to matter here, but maybe you have an idea of which I have not thought. We could certainly do better - just refcount each PartitionDesc. Have the relcache entry hold a refcount on the PartitionDesc and have a PartitionDirectory hold a refcount on the PartitionDesc; then, any time the refcount goes to 0, you can immediately destroy the PartitionDesc. Or, simpler, arrange things so that when the refcache's refcount goes to 0, any old PartitionDescs that are still hanging off of the latest one get destroyed then, not later. It's just a question of whether it's really worth the code, or whether we're fixing imaginary bugs by adding code that might have real ones. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm, I wonder why not. I suppose the answer is that >> the leak is worse in HEAD than before, but how come? > I'd like to know that, too, but right now I don't. I poked at this some more, by dint of adding some memory context usage tracking and running it under CLOBBER_CACHE_ALWAYS --- specifically, I looked at the update.sql test script, which is one of the two that is giving hyrax trouble. (Conveniently, that one runs successfully by itself, without need for the rest of the regression suite.) I found that even with 2455ab488, there still seemed to be an unreasonable amount of MessageContext bloat during some of the queries, on the order of 50MB in some of them. Investigating more closely, I found that 2455ab488 has a couple of simple oversights: it's still calling partition_bounds_create in the caller's context, allowing whatever that builds or leaks to be a MessageContext leak. And it's still calling get_rel_relkind() in the rd_pdcxt context, potentially leaking a *lot* of stuff into that normally-long-lived context, since that will result in fresh catcache (and thence potentially relcache) loads in some cases. But fixing that didn't seem to move the needle very much for update.sql. With some further investigation, I identified a few other main contributors to the bloat: RelationBuildTriggers RelationBuildRowSecurity RelationBuildPartitionKey Are you noticing a pattern yet? The real issue here is that we have this its-okay-to-leak-in-the-callers-context mindset throughout relcache.c, and when CLOBBER_CACHE_ALWAYS causes us to reload relcache entries a lot, that adds up fast. The partition stuff makes this worse, I think, primarily just because it increases the number of tables we touch in a typical query. What I'm thinking, therefore, is that 2455ab488 had the right idea but didn't take it far enough. We should remove the temp-context logic it added to RelationBuildPartitionDesc and instead put that one level up, in RelationBuildDesc, where the same temp context can serve all of these leak-prone sub-facilities. Possibly it'd make sense to conditionally compile this so that we only do it in a CLOBBER_CACHE_ALWAYS build. I'm not very sure about that, but arguably in a normal build the overhead of making and destroying a context would outweigh the cost of the leaked memory. The main argument I can think of for doing it all the time is that having memory allocation work noticeably differently in CCA builds than normal ones seems like a recipe for masking normal-mode bugs from the CCA animals. But that's only a guess not proven fact; it's also possible that having two different behaviors would expose bugs we'd otherwise have a hard time detecting, such as continuing to rely on the "temporary" data structures longer than we should. (This is all independent of the other questions raised on this and nearby threads.) regards, tom lane
On 15/03/2019 00:08, Tom Lane wrote: > What I'm thinking, therefore, is that 2455ab488 had the right idea but > didn't take it far enough. We should remove the temp-context logic it > added to RelationBuildPartitionDesc and instead put that one level up, > in RelationBuildDesc, where the same temp context can serve all of these > leak-prone sub-facilities. > > Possibly it'd make sense to conditionally compile this so that we only > do it in a CLOBBER_CACHE_ALWAYS build. I'm not very sure about that, > but arguably in a normal build the overhead of making and destroying > a context would outweigh the cost of the leaked memory. The main > argument I can think of for doing it all the time is that having memory > allocation work noticeably differently in CCA builds than normal ones > seems like a recipe for masking normal-mode bugs from the CCA animals. Having CLOBBER_CACHE_ALWAYS behave differently sounds horrible. We maintain a free list of AllocSetContexts nowadays, so creating a short-lived context should be pretty cheap. Or if it's still too expensive, we could create one short-lived context as a child of TopMemoryContext, and reuse that on every call, resetting it at the end of the function. - Heikki
On Fri, Mar 15, 2019 at 3:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > We maintain a free list of AllocSetContexts nowadays, so creating a > short-lived context should be pretty cheap. Or if it's still too > expensive, we could create one short-lived context as a child of > TopMemoryContext, and reuse that on every call, resetting it at the end > of the function. Relcache rebuild is reeentrant, I think, so we have to be careful about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 14, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I found that even with 2455ab488, there still seemed to be an > unreasonable amount of MessageContext bloat during some of the queries, > on the order of 50MB in some of them. Investigating more closely, > I found that 2455ab488 has a couple of simple oversights: it's still > calling partition_bounds_create in the caller's context, allowing > whatever that builds or leaks to be a MessageContext leak. Oops. > And it's > still calling get_rel_relkind() in the rd_pdcxt context, potentially > leaking a *lot* of stuff into that normally-long-lived context, since > that will result in fresh catcache (and thence potentially relcache) > loads in some cases. That's really unfortunate. I know CLOBBER_CACHE_ALWAYS testing has a lot of value, but get_rel_relkind() is the sort of function that developers need to be able to call without worrying with creating a massive memory leak. Maybe the only time it is really going to get called enough to matter is from within the cache invalidation machinery itself, in which case your idea will fix it. But I'm not sure about that. > What I'm thinking, therefore, is that 2455ab488 had the right idea but > didn't take it far enough. We should remove the temp-context logic it > added to RelationBuildPartitionDesc and instead put that one level up, > in RelationBuildDesc, where the same temp context can serve all of these > leak-prone sub-facilities. Yeah, that sounds good. > Possibly it'd make sense to conditionally compile this so that we only > do it in a CLOBBER_CACHE_ALWAYS build. I'm not very sure about that, > but arguably in a normal build the overhead of making and destroying > a context would outweigh the cost of the leaked memory. The main > argument I can think of for doing it all the time is that having memory > allocation work noticeably differently in CCA builds than normal ones > seems like a recipe for masking normal-mode bugs from the CCA animals. > But that's only a guess not proven fact; it's also possible that having > two different behaviors would expose bugs we'd otherwise have a hard > time detecting, such as continuing to rely on the "temporary" data > structures longer than we should. I lean toward thinking it makes more sense to be consistent, but I'm not sure that's right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 15/03/2019 00:08, Tom Lane wrote: >> Possibly it'd make sense to conditionally compile this so that we only >> do it in a CLOBBER_CACHE_ALWAYS build. I'm not very sure about that, >> but arguably in a normal build the overhead of making and destroying >> a context would outweigh the cost of the leaked memory. The main >> argument I can think of for doing it all the time is that having memory >> allocation work noticeably differently in CCA builds than normal ones >> seems like a recipe for masking normal-mode bugs from the CCA animals. > Having CLOBBER_CACHE_ALWAYS behave differently sounds horrible. I'm not entirely convinced of that. It's already pretty different... > We maintain a free list of AllocSetContexts nowadays, so creating a > short-lived context should be pretty cheap. Or if it's still too > expensive, we could create one short-lived context as a child of > TopMemoryContext, and reuse that on every call, resetting it at the end > of the function. RelationBuildDesc potentially recurses, so a single context won't do. Perhaps you're right that the context freelist would make this cheap enough to not matter, but I'm not sure of that either. What I'm inclined to do to get the buildfarm pressure off is to set things up so that by default we do this only in CCA builds, but there's a #define you can set from the compile command line to override that decision in either direction. Then, if somebody wants to investigate the performance effects in more detail, they could twiddle it easily. Depending on the results, we could alter the default policy. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Mar 14, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> And it's >> still calling get_rel_relkind() in the rd_pdcxt context, potentially >> leaking a *lot* of stuff into that normally-long-lived context, since >> that will result in fresh catcache (and thence potentially relcache) >> loads in some cases. > That's really unfortunate. I know CLOBBER_CACHE_ALWAYS testing has a > lot of value, but get_rel_relkind() is the sort of function that > developers need to be able to call without worrying with creating a > massive memory leak. I don't find that to be a concern. The bug here is that random code is being called while CurrentMemoryContext is pointing to a long-lived context, and that's just a localized violation of a well understood coding convention. I don't think that that means we need some large fire drill in response. >> What I'm thinking, therefore, is that 2455ab488 had the right idea but >> didn't take it far enough. We should remove the temp-context logic it >> added to RelationBuildPartitionDesc and instead put that one level up, >> in RelationBuildDesc, where the same temp context can serve all of these >> leak-prone sub-facilities. > Yeah, that sounds good. OK, I'll have a go at that today. >> Possibly it'd make sense to conditionally compile this so that we only >> do it in a CLOBBER_CACHE_ALWAYS build. I'm not very sure about that, > I lean toward thinking it makes more sense to be consistent, but I'm > not sure that's right. My feeling right now is that the its-okay-to-leak policy has been in place for decades and we haven't detected a problem with it before. Hence, doing that differently in normal builds should require some positive evidence that it'd be beneficial (or at least not a net loss). I don't have the time or interest to collect such evidence. But I'm happy to set up the patch to make it easy for someone else to do so, if anyone is sufficiently excited about this to do the work. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm, I wonder why not. I suppose the answer is that >> the leak is worse in HEAD than before, but how come? > I'd like to know that, too, but right now I don't. BTW, after closer study of 898e5e329 I have a theory as to why it made things worse for CCA animals: it causes relcache entries to be held open (via incremented refcounts) throughout planning, which they were not before. This means that, during each of the many RelationCacheInvalidate events that will occur during a planning cycle, we have to rebuild those relcache entries; previously they were just dropped once and that was the end of it till execution. You noted correctly that the existing PartitionDesc structure would generally get preserved during each reload --- but that has exactly zip to do with the amount of transient space consumed by execution of RelationBuildDesc. We still build a transient PartitionDesc that we then elect not to install. And besides that there's all the not-partitioning-related stuff that RelationBuildDesc can leak. This is probably largely irrelevant for non-CCA situations, since we don't expect forced relcache invals to occur often otherwise. But it seems to fit the facts, particularly that hyrax is only dying on queries that involve moderately large partitioning structures and hence quite a few relations pinned by PartitionDirectory entries. I remain of the opinion that we ought not have PartitionDirectories pinning relcache entries ... but this particular effect isn't really significant to that argument, I think. regards, tom lane
On Fri, Mar 15, 2019 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, after closer study of 898e5e329 I have a theory as to why it > made things worse for CCA animals: it causes relcache entries to be > held open (via incremented refcounts) throughout planning, which > they were not before. This means that, during each of the many > RelationCacheInvalidate events that will occur during a planning > cycle, we have to rebuild those relcache entries; previously they > were just dropped once and that was the end of it till execution. That makes sense. Thanks for looking at it, and for the insight. > You noted correctly that the existing PartitionDesc structure would > generally get preserved during each reload --- but that has exactly > zip to do with the amount of transient space consumed by execution > of RelationBuildDesc. We still build a transient PartitionDesc > that we then elect not to install. And besides that there's all the > not-partitioning-related stuff that RelationBuildDesc can leak. Right. So it's just that we turned a bunch of rebuild = false situations into rebuild = true situations. I think. > This is probably largely irrelevant for non-CCA situations, since > we don't expect forced relcache invals to occur often otherwise. > But it seems to fit the facts, particularly that hyrax is only dying > on queries that involve moderately large partitioning structures > and hence quite a few relations pinned by PartitionDirectory entries. > > I remain of the opinion that we ought not have PartitionDirectories > pinning relcache entries ... but this particular effect isn't really > significant to that argument, I think. Cool. I would still like to hear more about why you think that. I agree that the pinning is not great, but copying moderately large partitioning structures that are only likely to get more complex in future releases also does not seem great, so I think it may be the least of evils. You, on the other hand, seem to have a rather visceral hatred for it. That may be based on facts with which I am not acquainted or it may be that we have a different view of what good design looks like. If it's the latter, we may just have to agree to disagree and see if other people want to opine, but if it's the former, I'd like to know those facts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 15, 2019 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, after closer study of 898e5e329 I have a theory as to why it >> made things worse for CCA animals: it causes relcache entries to be >> held open (via incremented refcounts) throughout planning, which >> they were not before. This means that, during each of the many >> RelationCacheInvalidate events that will occur during a planning >> cycle, we have to rebuild those relcache entries; previously they >> were just dropped once and that was the end of it till execution. > Right. So it's just that we turned a bunch of rebuild = false > situations into rebuild = true situations. I think. More to the point, we turned *one* rebuild = false situation into a bunch of rebuild = true situations. I haven't studied it closely, but I think a CCA animal would probably see O(N^2) rebuild = true invocations in a query with N partitions, since each time expand_partitioned_rtentry opens another child table, we'll incur an AcceptInvalidationMessages call which leads to forced rebuilds of the previously-pinned rels. In non-CCA situations, almost always nothing happens with the previously-examined relcache entries. >> I remain of the opinion that we ought not have PartitionDirectories >> pinning relcache entries ... but this particular effect isn't really >> significant to that argument, I think. > Cool. I would still like to hear more about why you think that. I > agree that the pinning is not great, but copying moderately large > partitioning structures that are only likely to get more complex in > future releases also does not seem great, so I think it may be the > least of evils. You, on the other hand, seem to have a rather > visceral hatred for it. I agree that copying data isn't great. What I don't agree is that this solution is better. In particular, copying data out of the relcache rather than expecting the relcache to hold still over long periods is the way we've done things everywhere else, cf RelationGetIndexList, RelationGetStatExtList, RelationGetIndexExpressions, RelationGetIndexPredicate, RelationGetIndexAttrBitmap, RelationGetExclusionInfo, GetRelationPublicationActions. I don't care for a patch randomly deciding to do things differently on the basis of an unsupported-by-evidence argument that it might cost too much to copy the data. If we're going to make a push to reduce the amount of copying of that sort that we do, it should be a separately (and carefully) designed thing that applies to all the relcache substructures that have the issue, not one-off hacks that haven't been reviewed thoroughly. I especially don't care for the much-less-than-half-baked kluge of chaining the old rd_pdcxt onto the new one and hoping that it will go away at a suitable time. Yeah, that will work all right as long as it's not stressed very hard, but we've found repeatedly that users will find a way to overstress places where we're sloppy about resource management. In this case, since the leak is potentially of session lifespan, I doubt it's even very hard to cause unbounded growth in relcache space usage --- you just have to do $whatever over and over. Let's see ... regression=# create table parent (a text, b int) partition by list (a); CREATE TABLE regression=# create table child (a text, b int); CREATE TABLE regression=# do $$ regression$# begin regression$# for i in 1..10000000 loop regression$# alter table parent attach partition child for values in ('x'); regression$# alter table parent detach partition child; regression$# end loop; regression$# end $$; After letting that run for awhile, I've got enough rd_pdcxts to send MemoryContextStats into a nasty loop. CacheMemoryContext: 4259904 total in 11 blocks; 1501120 free (19 chunks); 2758784 used partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent ... yadda yadda ... The leak isn't terribly fast, since this is an expensive thing to be doing, but it's definitely leaking. > That may be based on facts with which I am > not acquainted or it may be that we have a different view of what good > design looks like. I don't think solving the same problem in multiple ways constitutes good design, barring compelling reasons for the solutions being different, which you haven't presented. Solving the same problem in multiple ways some of which are buggy is *definitely* not good design. regards, tom lane
On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > More to the point, we turned *one* rebuild = false situation into > a bunch of rebuild = true situations. I haven't studied it closely, > but I think a CCA animal would probably see O(N^2) rebuild = true > invocations in a query with N partitions, since each time > expand_partitioned_rtentry opens another child table, we'll incur > an AcceptInvalidationMessages call which leads to forced rebuilds > of the previously-pinned rels. In non-CCA situations, almost always > nothing happens with the previously-examined relcache entries. That's rather unfortunate. I start to think that clobbering the cache "always" is too hard a line. > I agree that copying data isn't great. What I don't agree is that this > solution is better. In particular, copying data out of the relcache > rather than expecting the relcache to hold still over long periods > is the way we've done things everywhere else, cf RelationGetIndexList, > RelationGetStatExtList, RelationGetIndexExpressions, > RelationGetIndexPredicate, RelationGetIndexAttrBitmap, > RelationGetExclusionInfo, GetRelationPublicationActions. I don't care > for a patch randomly deciding to do things differently on the basis of an > unsupported-by-evidence argument that it might cost too much to copy the > data. If we're going to make a push to reduce the amount of copying of > that sort that we do, it should be a separately (and carefully) designed > thing that applies to all the relcache substructures that have the issue, > not one-off hacks that haven't been reviewed thoroughly. That's not an unreasonable argument. On the other hand, if you never try new stuff, you lose opportunities to explore things that might turn out to be better and worth adopting more widely. I am not very convinced that it makes sense to lump something like RelationGetIndexAttrBitmap in with something like RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a single Bitmapset, whereas the data structure RelationGetPartitionDesc is vastly larger and more complex. You can't say "well, if it's best to copy 32 bytes of data out of the relcache every time we need it, it must also be right to copy 10k or 100k of data out of the relcache every time we need it." There is another difference as well: there's a good chance that somebody is going to want to mutate a Bitmapset, whereas they had BETTER NOT think that they can mutate the PartitionDesc. So returning an uncopied Bitmapset is kinda risky in a way that returning an uncopied PartitionDesc is not. If we want an at-least-somewhat unified solution here, I think we need to bite the bullet and make the planner hold a reference to the relcache throughout planning. (The executor already keeps it open, I believe.) Otherwise, how is the relcache supposed to know when it can throw stuff away and when it can't? The only alternative seems to be to have each subsystem hold its own reference count, as I did with the PartitionDirectory stuff, which is not something we'd want to scale out. > I especially don't care for the much-less-than-half-baked kluge of > chaining the old rd_pdcxt onto the new one and hoping that it will go away > at a suitable time. It will go away at a suitable time, but maybe not at the soonest suitable time. It wouldn't be hard to improve that, though. The easiest thing to do, I think, would be to have an rd_oldpdcxt and stuff the old context there. If there already is one there, parent the new one under it. When RelationDecrementReferenceCount reduces the reference count to zero, destroy anything found in rd_oldpdcxt. With that change, things get destroyed at the earliest time at which we know the old things aren't referenced, instead of the earliest time at which they are not referenced + an invalidation arrives. > regression=# create table parent (a text, b int) partition by list (a); > CREATE TABLE > regression=# create table child (a text, b int); > CREATE TABLE > regression=# do $$ > regression$# begin > regression$# for i in 1..10000000 loop > regression$# alter table parent attach partition child for values in ('x'); > regression$# alter table parent detach partition child; > regression$# end loop; > regression$# end $$; Interesting example. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/03/14 10:40, Amit Langote wrote: > On 2019/03/14 5:18, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Meanwhile, who's going to take point on cleaning up rd_partcheck? >>>> I don't really understand this code well enough to know whether that >>>> can share one of the existing partitioning-related sub-contexts. >> >>> To your question, I think it probably can't share a context. Briefly, >>> rd_partkey can't change ever, except that when a partitioned relation >>> is in the process of being created it is briefly NULL; once it obtains >>> a value, that value cannot be changed. If you want to range-partition >>> a list-partitioned table or something like that, you have to drop the >>> table and create a new one. I think that's a perfectly acceptable >>> permanent limitation and I have no urge whatever to change it. >>> rd_partdesc changes when you attach or detach a child partition. >>> rd_partcheck is the reverse: it changes when you attach or detach this >>> partition to or from a parent. >> >> Got it. Yeah, it seems like the clearest and least bug-prone solution >> is for those to be in three separate sub-contexts. The only reason >> I was trying to avoid that was the angle of how to back-patch into 11. >> However, we can just add the additional context pointer field at the >> end of the Relation struct in v11, and that should be good enough to >> avoid ABI problems. > > Agree that rd_partcheck needs its own context as you have complained in > the past [1]. > > I think we'll need to back-patch this fix to PG 10 as well. I've attached > patches for all 3 branches. > > Thanks, > Amit > > [1] https://www.postgresql.org/message-id/22236.1523374067%40sss.pgh.pa.us Should I add this patch to Older Bugs [1]? Thanks, Amit [1] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > Should I add this patch to Older Bugs [1]? Yeah, it's an open issue IMO. I think we've been focusing on getting as many feature patches done as we could during the CF, but now it's time to start mopping up problems like this one. regards, tom lane
On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > > Should I add this patch to Older Bugs [1]? > > Yeah, it's an open issue IMO. I think we've been focusing on getting > as many feature patches done as we could during the CF, but now it's > time to start mopping up problems like this one. Do you have any further thoughts based on my last response? Does anyone else wish to offer opinions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/04/08 22:59, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> Should I add this patch to Older Bugs [1]? > > Yeah, it's an open issue IMO. I think we've been focusing on getting > as many feature patches done as we could during the CF, but now it's > time to start mopping up problems like this one. Thanks, done. Regards, Amit
On 2019/03/16 6:41, Robert Haas wrote: > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I agree that copying data isn't great. What I don't agree is that this >> solution is better. In particular, copying data out of the relcache >> rather than expecting the relcache to hold still over long periods >> is the way we've done things everywhere else, cf RelationGetIndexList, >> RelationGetStatExtList, RelationGetIndexExpressions, >> RelationGetIndexPredicate, RelationGetIndexAttrBitmap, >> RelationGetExclusionInfo, GetRelationPublicationActions. I don't care >> for a patch randomly deciding to do things differently on the basis of an >> unsupported-by-evidence argument that it might cost too much to copy the >> data. If we're going to make a push to reduce the amount of copying of >> that sort that we do, it should be a separately (and carefully) designed >> thing that applies to all the relcache substructures that have the issue, >> not one-off hacks that haven't been reviewed thoroughly. > > That's not an unreasonable argument. On the other hand, if you never > try new stuff, you lose opportunities to explore things that might > turn out to be better and worth adopting more widely. > > I am not very convinced that it makes sense to lump something like > RelationGetIndexAttrBitmap in with something like > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a > single Bitmapset, whereas the data structure RelationGetPartitionDesc > is vastly larger and more complex. You can't say "well, if it's best > to copy 32 bytes of data out of the relcache every time we need it, it > must also be right to copy 10k or 100k of data out of the relcache > every time we need it." > > There is another difference as well: there's a good chance that > somebody is going to want to mutate a Bitmapset, whereas they had > BETTER NOT think that they can mutate the PartitionDesc. So returning > an uncopied Bitmapset is kinda risky in a way that returning an > uncopied PartitionDesc is not. > > If we want an at-least-somewhat unified solution here, I think we need > to bite the bullet and make the planner hold a reference to the > relcache throughout planning. (The executor already keeps it open, I > believe.) Otherwise, how is the relcache supposed to know when it can > throw stuff away and when it can't? The only alternative seems to be > to have each subsystem hold its own reference count, as I did with the > PartitionDirectory stuff, which is not something we'd want to scale > out. Fwiw, I'd like to vote for planner holding the relcache reference open throughout planning. The planner could then reference the various substructures directly (using a non-copying accessor), except those that something in the planner might want to modify, in which case use the copying accessor. Thanks, Amit
On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote: > On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> Yeah, it's an open issue IMO. I think we've been focusing on getting >> as many feature patches done as we could during the CF, but now it's >> time to start mopping up problems like this one. Please note that it is registered as an older bug and not an open item. > Do you have any further thoughts based on my last response? So your last response is that: https://www.postgresql.org/message-id/CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com And what are you proposing as patch? Perhaps something among those lines? https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp > Does anyone else wish to offer opinions? It seems to me that Tom's argument to push in the way relcache information is handled by copying its contents sounds sensible to me. That's not perfect, but it is consistent with what exists (note PublicationActions for a rather-still-not-much-complex example of structure which gets copied from the relcache). -- Michael
Attachment
Hi, On 2019/04/10 15:42, Michael Paquier wrote: > On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote: >> On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >>> Yeah, it's an open issue IMO. I think we've been focusing on getting >>> as many feature patches done as we could during the CF, but now it's >>> time to start mopping up problems like this one. > > Please note that it is registered as an older bug and not an open > item. The problem lies in all branches that have partitioning, so it should be listed under Older Bugs, right? You may have noticed that I posted patches for all branches down to 10. >> Do you have any further thoughts based on my last response? > > So your last response is that: > https://www.postgresql.org/message-id/CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com > And what are you proposing as patch? Perhaps something among those > lines? > https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp AFAIK, the patch there isn't meant to solve problems discussed at the 1st link. It's meant to fix poor cache memory management of partition constraint expression trees, which seems to be a separate issue from the PartitionDesc memory management issue (the latter is the main topic of this thread.) Problem with partition constraint expression trees was only mentioned in passing in this thread [1], although it had also come up in the past, as I said when posting the patch. >> Does anyone else wish to offer opinions? > > It seems to me that Tom's argument to push in the way relcache > information is handled by copying its contents sounds sensible to me. > That's not perfect, but it is consistent with what exists (note > PublicationActions for a rather-still-not-much-complex example of > structure which gets copied from the relcache). I gave my vote for direct access of unchangeable relcache substructures (TupleDesc, PartitionDesc, etc.), because they're accessed during the planning of every user query and fairly expensive to copy compared to list of indexes or PublicationActions that you're citing. To further my point a bit, I wonder why PublicationActions needs to be copied out of relcache at all? Looking at its usage in execReplication.c, it seems we can do fine without copying, because we are holding both a lock and a relcache reference on the replication target relation during the entirety of apply_handle_insert(), which means that information can't change under us, neither logically nor physically. Also, to reiterate what I think was one of Robert's points upthread [2], the reason for requiring some code to copy the relcache substructures out of relcache should be that the caller might want change its content; for example, planner or its hooks may want to add/remove an index to/from the list of indexes copied from the relcache. The reason for copying should not be that relcache content may change under us despite holding a lock and relcache reference. Thanks, Amit [1] https://www.postgresql.org/message-id/7961.1552498252%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CA%2BTgmoa5rT%2BZR%2BVv%2Bq1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw%40mail.gmail.com
On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote: > The problem lies in all branches that have partitioning, so it should be > listed under Older Bugs, right? You may have noticed that I posted > patches for all branches down to 10. I have noticed. The message from Tom upthread outlined that an open item should be added, but this is not one. That's what I wanted to emphasize. Thanks for making it clearer. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote: >> The problem lies in all branches that have partitioning, so it should be >> listed under Older Bugs, right? You may have noticed that I posted >> patches for all branches down to 10. > I have noticed. The message from Tom upthread outlined that an open > item should be added, but this is not one. That's what I wanted to > emphasize. Thanks for making it clearer. To clarify a bit: there's more than one problem here. Amit added an open item about pre-existing leaks related to rd_partcheck. (I'm going to review and hopefully push his fix for that today.) However, there's a completely separate leak associated with mismanagement of rd_pdcxt, as I showed in [1], and it doesn't seem like we have consensus about what to do about that one. AFAIK that's a new bug in 12 (caused by 898e5e329) and so it ought to be in the main open-items list. This thread has discussed a bunch of possible future changes like trying to replace copying of relcache-provided data structures with reference-counting, but I don't think any such thing could be v12 material at this point. We do need to fix the newly added leak though. regards, tom lane [1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us
Robert Haas <robertmhaas@gmail.com> writes: > As a parenthetical note, I observe that relcache.c seems to know > almost nothing about rd_partcheck. rd_partkey and rd_partdesc both > have handling in RelationClearRelation(), but rd_partcheck does not, > and I suspect that's wrong. So the problems are probably not confined > to the relcache-drop-time problem that you observed. I concluded that that's not parenthetical but pretty relevant... Attached is a revised version of Amit's patch at [1] that makes these data structures be treated more similarly. I also added some Asserts and comment improvements to address the complaints I made upthread about under-documentation of all this logic. I also cleaned up the problem the code had with failing to distinguish "partcheck list is NIL" from "partcheck list hasn't been computed yet". For a relation with no such constraints, generate_partition_qual would do the full pushups every time. I'm not sure if the case actually occurs commonly enough that that's a performance problem, but failure to account for it made my added assertions fall over :-( and I thought fixing it was better than weakening the assertions. I haven't made back-patch versions yet. I'd expect they could be substantially the same, except the two new fields have to go at the end of struct RelationData to avoid ABI breaks. Oh: we might also need some change in RelationCacheInitializePhase3, depending on the decision about [2]. regards, tom lane [1] https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp [2] https://www.postgresql.org/message-id/5706.1555093031@sss.pgh.pa.us diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c index e436d1e..4d6595b 100644 --- a/src/backend/partitioning/partdesc.c +++ b/src/backend/partitioning/partdesc.c @@ -49,10 +49,13 @@ typedef struct PartitionDirectoryEntry /* * RelationBuildPartitionDesc - * Form rel's partition descriptor + * Form rel's partition descriptor, and store in relcache entry * - * Not flushed from the cache by RelationClearRelation() unless changed because - * of addition or removal of partition. + * Note: the descriptor won't be flushed from the cache by + * RelationClearRelation() unless it's changed because of + * addition or removal of a partition. Hence, code holding a lock + * that's sufficient to prevent that can assume that rd_partdesc + * won't change underneath it. */ void RelationBuildPartitionDesc(Relation rel) @@ -173,7 +176,19 @@ RelationBuildPartitionDesc(Relation rel) ++i; } - /* Now build the actual relcache partition descriptor */ + /* Assert we aren't about to leak any old data structure */ + Assert(rel->rd_pdcxt == NULL); + Assert(rel->rd_partdesc == NULL); + + /* + * Now build the actual relcache partition descriptor. Note that the + * order of operations here is fairly critical. If we fail partway + * through this code, we won't have leaked memory because the rd_pdcxt is + * attached to the relcache entry immediately, so it'll be freed whenever + * the entry is rebuilt or destroyed. However, we don't assign to + * rd_partdesc until the cached data structure is fully complete and + * valid, so that no other code might try to use it. + */ rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext, "partition descriptor", ALLOCSET_SMALL_SIZES); diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c index 8f43d68..29b4a76 100644 --- a/src/backend/utils/cache/partcache.c +++ b/src/backend/utils/cache/partcache.c @@ -41,11 +41,11 @@ static List *generate_partition_qual(Relation rel); /* * RelationBuildPartitionKey - * Build and attach to relcache partition key data of relation + * Build partition key data of relation, and attach to relcache * * Partitioning key data is a complex structure; to avoid complicated logic to * free individual elements whenever the relcache entry is flushed, we give it - * its own memory context, child of CacheMemoryContext, which can easily be + * its own memory context, a child of CacheMemoryContext, which can easily be * deleted on its own. To avoid leaking memory in that context in case of an * error partway through this function, the context is initially created as a * child of CurTransactionContext and only re-parented to CacheMemoryContext @@ -144,6 +144,7 @@ RelationBuildPartitionKey(Relation relation) MemoryContextSwitchTo(oldcxt); } + /* Allocate assorted arrays in the partkeycxt, which we'll fill below */ oldcxt = MemoryContextSwitchTo(partkeycxt); key->partattrs = (AttrNumber *) palloc0(key->partnatts * sizeof(AttrNumber)); key->partopfamily = (Oid *) palloc0(key->partnatts * sizeof(Oid)); @@ -151,8 +152,6 @@ RelationBuildPartitionKey(Relation relation) key->partsupfunc = (FmgrInfo *) palloc0(key->partnatts * sizeof(FmgrInfo)); key->partcollation = (Oid *) palloc0(key->partnatts * sizeof(Oid)); - - /* Gather type and collation info as well */ key->parttypid = (Oid *) palloc0(key->partnatts * sizeof(Oid)); key->parttypmod = (int32 *) palloc0(key->partnatts * sizeof(int32)); key->parttyplen = (int16 *) palloc0(key->partnatts * sizeof(int16)); @@ -235,6 +234,10 @@ RelationBuildPartitionKey(Relation relation) ReleaseSysCache(tuple); + /* Assert that we're not leaking any old data during assignments below */ + Assert(relation->rd_partkeycxt == NULL); + Assert(relation->rd_partkey == NULL); + /* * Success --- reparent our context and make the relcache point to the * newly constructed key @@ -305,11 +308,9 @@ get_partition_qual_relid(Oid relid) * Generate partition predicate from rel's partition bound expression. The * function returns a NIL list if there is no predicate. * - * Result expression tree is stored CacheMemoryContext to ensure it survives - * as long as the relcache entry. But we should be running in a less long-lived - * working context. To avoid leaking cache memory if this routine fails partway - * through, we build in working memory and then copy the completed structure - * into cache memory. + * We cache a copy of the result in the relcache entry, after constructing + * it using the caller's context. This approach avoids leaking any data + * into long-lived cache contexts, especially if we fail partway through. */ static List * generate_partition_qual(Relation rel) @@ -326,8 +327,8 @@ generate_partition_qual(Relation rel) /* Guard against stack overflow due to overly deep partition tree */ check_stack_depth(); - /* Quick copy */ - if (rel->rd_partcheck != NIL) + /* If we already cached the result, just return a copy */ + if (rel->rd_partcheckvalid) return copyObject(rel->rd_partcheck); /* Grab at least an AccessShareLock on the parent table */ @@ -373,13 +374,36 @@ generate_partition_qual(Relation rel) if (found_whole_row) elog(ERROR, "unexpected whole-row reference found in partition key"); - /* Save a copy in the relcache */ - oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - rel->rd_partcheck = copyObject(result); - MemoryContextSwitchTo(oldcxt); + /* Assert that we're not leaking any old data during assignments below */ + Assert(rel->rd_partcheckcxt == NULL); + Assert(rel->rd_partcheck == NIL); + + /* + * Save a copy in the relcache. The order of these operations is fairly + * critical to avoid memory leaks and ensure that we don't leave a corrupt + * relcache entry if we fail partway through copyObject. + * + * If, as is definitely possible, the partcheck list is NIL, then we do + * not need to make a context to hold it. + */ + if (result != NIL) + { + rel->rd_partcheckcxt = AllocSetContextCreate(CacheMemoryContext, + "partition constraint", + ALLOCSET_SMALL_SIZES); + MemoryContextCopyAndSetIdentifier(rel->rd_partcheckcxt, + RelationGetRelationName(rel)); + oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt); + rel->rd_partcheck = copyObject(result); + MemoryContextSwitchTo(oldcxt); + } + else + rel->rd_partcheck = NIL; + rel->rd_partcheckvalid = true; /* Keep the parent locked until commit */ relation_close(parent, NoLock); + /* Return the working copy to the caller */ return result; } diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 64f3c2e..4b884d5 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1175,11 +1175,15 @@ RelationBuildDesc(Oid targetRelId, bool insertIt) } else { - relation->rd_partkeycxt = NULL; relation->rd_partkey = NULL; + relation->rd_partkeycxt = NULL; relation->rd_partdesc = NULL; relation->rd_pdcxt = NULL; } + /* ... but partcheck is not loaded till asked for */ + relation->rd_partcheck = NIL; + relation->rd_partcheckvalid = false; + relation->rd_partcheckcxt = NULL; /* * initialize access method information @@ -2364,8 +2368,8 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc) MemoryContextDelete(relation->rd_partkeycxt); if (relation->rd_pdcxt) MemoryContextDelete(relation->rd_pdcxt); - if (relation->rd_partcheck) - pfree(relation->rd_partcheck); + if (relation->rd_partcheckcxt) + MemoryContextDelete(relation->rd_partcheckcxt); if (relation->rd_fdwroutine) pfree(relation->rd_fdwroutine); pfree(relation); @@ -5600,18 +5604,20 @@ load_relcache_init_file(bool shared) * format is complex and subject to change). They must be rebuilt if * needed by RelationCacheInitializePhase3. This is not expected to * be a big performance hit since few system catalogs have such. Ditto - * for RLS policy data, index expressions, predicates, exclusion info, - * and FDW info. + * for RLS policy data, partition info, index expressions, predicates, + * exclusion info, and FDW info. */ rel->rd_rules = NULL; rel->rd_rulescxt = NULL; rel->trigdesc = NULL; rel->rd_rsdesc = NULL; - rel->rd_partkeycxt = NULL; rel->rd_partkey = NULL; - rel->rd_pdcxt = NULL; + rel->rd_partkeycxt = NULL; rel->rd_partdesc = NULL; + rel->rd_pdcxt = NULL; rel->rd_partcheck = NIL; + rel->rd_partcheckvalid = false; + rel->rd_partcheckcxt = NULL; rel->rd_indexprs = NIL; rel->rd_indpred = NIL; rel->rd_exclops = NULL; diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index 764e6fa..51ad250 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -95,11 +95,13 @@ typedef struct RelationData List *rd_fkeylist; /* list of ForeignKeyCacheInfo (see below) */ bool rd_fkeyvalid; /* true if list has been computed */ - MemoryContext rd_partkeycxt; /* private memory cxt for the below */ struct PartitionKeyData *rd_partkey; /* partition key, or NULL */ - MemoryContext rd_pdcxt; /* private context for partdesc */ + MemoryContext rd_partkeycxt; /* private context for rd_partkey, if any */ struct PartitionDescData *rd_partdesc; /* partitions, or NULL */ + MemoryContext rd_pdcxt; /* private context for partdesc, if any */ List *rd_partcheck; /* partition CHECK quals */ + bool rd_partcheckvalid; /* true if list has been computed */ + MemoryContext rd_partcheckcxt; /* private cxt for rd_partcheck, if any */ /* data managed by RelationGetIndexList: */ List *rd_indexlist; /* list of OIDs of indexes on relation */
Thanks for the updated patch. On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > As a parenthetical note, I observe that relcache.c seems to know > > almost nothing about rd_partcheck. rd_partkey and rd_partdesc both > > have handling in RelationClearRelation(), but rd_partcheck does not, > > and I suspect that's wrong. So the problems are probably not confined > > to the relcache-drop-time problem that you observed. > > I concluded that that's not parenthetical but pretty relevant... Hmm, do you mean we should grow the same "keep_" logic for rd_partcheck as we have for rd_partkey and rd_partdesc? I don't see it in the updated patch though. > Attached is a revised version of Amit's patch at [1] that makes these > data structures be treated more similarly. I also added some Asserts > and comment improvements to address the complaints I made upthread about > under-documentation of all this logic. Thanks for all the improvements. > I also cleaned up the problem the code had with failing to distinguish > "partcheck list is NIL" from "partcheck list hasn't been computed yet". > For a relation with no such constraints, generate_partition_qual would do > the full pushups every time. Actually, callers must have checked that the table is a partition (relispartition). It wouldn't be a bad idea to add an Assert or elog in generate_partition_qual. > I'm not sure if the case actually occurs > commonly enough that that's a performance problem, but failure to account > for it made my added assertions fall over :-( and I thought fixing it > was better than weakening the assertions. Hmm, I wonder why the Asserts failed given what I said above. > I haven't made back-patch versions yet. I'd expect they could be > substantially the same, except the two new fields have to go at the > end of struct RelationData to avoid ABI breaks. To save you the pain of finding the right files to patch in back-branches, I made those (attached). Thanks, Amit
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I concluded that that's not parenthetical but pretty relevant... > Hmm, do you mean we should grow the same "keep_" logic for > rd_partcheck as we have for rd_partkey and rd_partdesc? I don't see > it in the updated patch though. No, the "keep_" stuff is only necessary when we're trying to preserve data structures in-place, which is only important if non-relcache code might be using pointers to it. Since rd_partcheck is never directly accessed by external code, only copied, there can't be any live pointers to it to worry about. Besides which, since it's load on demand rather than something that RelationBuildDesc forces to be valid immediately, any comparison would be guaranteed to fail: the field in the new reldesc will always be empty at this point. Perhaps there's an argument that it should be load-immediately not load-on-demand, but that would be an optimization not a bug fix, and I'm skeptical that it'd be an improvement anyway. Probably this is something to revisit whenever somebody gets around to addressing the whole copy-vs-dont-copy-vs-use-a-refcount business that we were handwaving about upthread. >> I also cleaned up the problem the code had with failing to distinguish >> "partcheck list is NIL" from "partcheck list hasn't been computed yet". >> For a relation with no such constraints, generate_partition_qual would do >> the full pushups every time. > Actually, callers must have checked that the table is a partition > (relispartition). That does not mean that it's guaranteed to have any partcheck constraints; there are counterexamples in the regression tests. It looks like the main case is a LIST-partitioned table that has only a default partition. regards, tom lane
Amit Langote <amitlangote09@gmail.com> writes: > To save you the pain of finding the right files to patch in > back-branches, I made those (attached). BTW, as far as that goes: I'm of the opinion that the partitioning logic's factorization in this area is pretty awful, and v12 has made it worse not better. It's important IMO that there be a clear distinction between code that belongs to/can manipulate the relcache, and outside code that ought at most to examine it (and maybe not even that, depending on where we come down on this copy-vs-refcount business). Maybe allowing utils/cache/partcache.c to be effectively halfway inside the relcache module is tolerable, but I don't think it's great design. Allowing files over in partitioning/ to also have functions inside that boundary is really not good, especially when you can't even say that all of partdesc.c is part of relcache. I"m seriously inclined to put RelationBuildPartitionDesc back where it was in v11. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/10 15:42, Michael Paquier wrote: >> It seems to me that Tom's argument to push in the way relcache >> information is handled by copying its contents sounds sensible to me. >> That's not perfect, but it is consistent with what exists (note >> PublicationActions for a rather-still-not-much-complex example of >> structure which gets copied from the relcache). > I gave my vote for direct access of unchangeable relcache substructures > (TupleDesc, PartitionDesc, etc.), because they're accessed during the > planning of every user query and fairly expensive to copy compared to list > of indexes or PublicationActions that you're citing. To further my point > a bit, I wonder why PublicationActions needs to be copied out of relcache > at all? Looking at its usage in execReplication.c, it seems we can do > fine without copying, because we are holding both a lock and a relcache > reference on the replication target relation during the entirety of > apply_handle_insert(), which means that information can't change under us, > neither logically nor physically. So the point here is that that reasoning is faulty. You *cannot* assume, no matter how strong a lock or how many pins you hold, that a relcache entry will not get rebuilt underneath you. Cache flushes happen regardless. And unless relcache.c takes special measures to prevent it, a rebuild will result in moving subsidiary data structures and thereby breaking any pointers you may have pointing into those data structures. For certain subsidiary structures such as the relation tupdesc, we do take such special measures: that's what the "keep_xxx" dance in RelationClearRelation is. However, that's expensive, both in cycles and maintenance effort: it requires having code that can decide equality of the subsidiary data structures, which we might well have no other use for, and which we certainly don't have strong tests for correctness of. It's also very error-prone for callers, because there isn't any good way to cross-check that code using a long-lived pointer to a subsidiary structure is holding a lock that's strong enough to guarantee non-mutation of that structure, or even that relcache.c provides any such guarantee at all. (If our periodic attempts to reduce lock strength for assorted DDL operations don't scare the pants off you in this connection, you have not thought hard enough about it.) So I think that even though we've largely gotten away with this approach so far, it's also a half-baked kluge that we should be looking to get rid of, not extend to yet more cases. To my mind there are only two trustworthy solutions to the problem of wanting time-extended usage of a relcache subsidiary data structure: one is to copy it, and the other is to reference-count it. I think that going over to a reference-count-based approach for many of these structures might well be something we should do in future, maybe even the very near future. In the mean time, though, I'm not really satisfied with inserting half-baked kluges, especially not ones that are different from our other half-baked kluges for similar purposes. I think that's a path to creating hard-to-reproduce bugs. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I agree that copying data isn't great. What I don't agree is that this >> solution is better. > I am not very convinced that it makes sense to lump something like > RelationGetIndexAttrBitmap in with something like > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a > single Bitmapset, whereas the data structure RelationGetPartitionDesc > is vastly larger and more complex. You can't say "well, if it's best > to copy 32 bytes of data out of the relcache every time we need it, it > must also be right to copy 10k or 100k of data out of the relcache > every time we need it." I did not say that. What I did say is that they're both correct solutions. Copying large data structures is clearly a potential performance problem, but that doesn't mean we can take correctness shortcuts. > If we want an at-least-somewhat unified solution here, I think we need > to bite the bullet and make the planner hold a reference to the > relcache throughout planning. (The executor already keeps it open, I > believe.) Otherwise, how is the relcache supposed to know when it can > throw stuff away and when it can't? The real problem with that is that we *still* won't know whether it's okay to throw stuff away or not. The issue with these subsidiary data structures is exactly that we're trying to reduce the lock strength required for changing one of them, and as soon as you make that lock strength less than AEL, you have the problem that that value may need to change in relcache entries despite them being pinned. The code I'm complaining about is trying to devise a shortcut solution to exactly that situation ... and it's not a good shortcut. > The only alternative seems to be to have each subsystem hold its own > reference count, as I did with the PartitionDirectory stuff, which is > not something we'd want to scale out. Well, we clearly don't want to devise a separate solution for every subsystem, but it doesn't seem out of the question to me that we could build a "reference counted cache sub-structure" mechanism and apply it to each of these relcache fields. Maybe it could be unified with the existing code in the typcache that does a similar thing. Sure, this'd be a fair amount of work, but we've done it before. Syscache entries didn't use to have reference counts, for example. BTW, the problem I have with the PartitionDirectory stuff is exactly that it *isn't* a reference-counted solution. If it were, we'd not have this problem of not knowing when we could free rd_pdcxt. >> I especially don't care for the much-less-than-half-baked kluge of >> chaining the old rd_pdcxt onto the new one and hoping that it will go away >> at a suitable time. > It will go away at a suitable time, but maybe not at the soonest > suitable time. It wouldn't be hard to improve that, though. The > easiest thing to do, I think, would be to have an rd_oldpdcxt and > stuff the old context there. If there already is one there, parent > the new one under it. When RelationDecrementReferenceCount reduces > the reference count to zero, destroy anything found in rd_oldpdcxt. Meh. While it seems likely that that would mask most practical problems, it still seems like covering up a wart with a dirty bandage. In particular, that fix doesn't fix anything unless relcache reference counts go to zero pretty quickly --- which I'll just note is directly contrary to your enthusiasm for holding relcache pins longer. I think that what we ought to do for v12 is have PartitionDirectory copy the data, and then in v13 work on creating real reference-count infrastructure that would allow eliminating the copy steps with full safety. The $64 question is whether that really would cause unacceptable performance problems. To look into that, I made the attached WIP patches. (These are functionally complete, but I didn't bother for instance with removing the hunk that 898e5e329 added to relcache.c, and the comments need work, etc.) The first one just changes the PartitionDirectory code to do that, and then the second one micro-optimizes partition_bounds_copy() to make it somewhat less expensive, mostly by collapsing lots of small palloc's into one big one. What I get for test cases like [1] is single-partition SELECT, hash partitioning: N tps, HEAD tps, patch 2 11426.243754 11448.615193 8 11254.833267 11374.278861 32 11288.329114 11371.942425 128 11222.329256 11185.845258 512 11001.177137 10572.917288 1024 10612.456470 9834.172965 4096 8819.110195 7021.864625 8192 7372.611355 5276.130161 single-partition SELECT, range partitioning: N tps, HEAD tps, patch 2 11037.855338 11153.595860 8 11085.218022 11019.132341 32 10994.348207 10935.719951 128 10884.417324 10532.685237 512 10635.583411 9578.108915 1024 10407.286414 8689.585136 4096 8361.463829 5139.084405 8192 7075.880701 3442.542768 Now certainly these numbers suggest that avoiding the copy could be worth our trouble, but these results are still several orders of magnitude better than where we were two weeks ago [2]. Plus, this is an extreme case that's not really representative of real-world usage, since the test tables have neither indexes nor any data. In practical situations the baseline would be lower and the dropoff less bad. So I don't feel bad about shipping v12 with these sorts of numbers and hoping for more improvement later. regards, tom lane [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24 diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c index 4d6595b..96129e7 100644 --- a/src/backend/partitioning/partdesc.c +++ b/src/backend/partitioning/partdesc.c @@ -43,7 +43,6 @@ typedef struct PartitionDirectoryData typedef struct PartitionDirectoryEntry { Oid reloid; - Relation rel; PartitionDesc pd; } PartitionDirectoryEntry; @@ -235,6 +234,34 @@ RelationBuildPartitionDesc(Relation rel) } /* + * copyPartitionDesc + * + * Should be somewhere else perhaps? + */ +static PartitionDesc +copyPartitionDesc(PartitionDesc src, PartitionKey key) +{ + PartitionDesc partdesc; + int nparts = src->nparts; + + partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData)); + partdesc->nparts = nparts; + /* If there are no partitions, the rest of the partdesc can stay zero */ + if (nparts > 0) + { + partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid)); + memcpy(partdesc->oids, src->oids, nparts * sizeof(Oid)); + + partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool)); + memcpy(partdesc->is_leaf, src->is_leaf, nparts * sizeof(bool)); + + partdesc->boundinfo = partition_bounds_copy(src->boundinfo, key); + } + + return partdesc; +} + +/* * CreatePartitionDirectory * Create a new partition directory object. */ @@ -265,7 +292,7 @@ CreatePartitionDirectory(MemoryContext mcxt) * * The purpose of this function is to ensure that we get the same * PartitionDesc for each relation every time we look it up. In the - * face of current DDL, different PartitionDescs may be constructed with + * face of concurrent DDL, different PartitionDescs may be constructed with * different views of the catalog state, but any single particular OID * will always get the same PartitionDesc for as long as the same * PartitionDirectory is used. @@ -281,13 +308,16 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel) if (!found) { /* - * We must keep a reference count on the relation so that the - * PartitionDesc to which we are pointing can't get destroyed. + * Currently, neither RelationGetPartitionDesc nor + * RelationGetPartitionKey can leak any memory; but if they start + * doing so, do those calls before switching into pdir_mcxt. */ - RelationIncrementReferenceCount(rel); - pde->rel = rel; - pde->pd = RelationGetPartitionDesc(rel); + MemoryContext oldcontext = MemoryContextSwitchTo(pdir->pdir_mcxt); + + pde->pd = copyPartitionDesc(RelationGetPartitionDesc(rel), + RelationGetPartitionKey(rel)); Assert(pde->pd != NULL); + MemoryContextSwitchTo(oldcontext); } return pde->pd; } @@ -296,17 +326,11 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel) * DestroyPartitionDirectory * Destroy a partition directory. * - * Release the reference counts we're holding. + * This is a no-op at present; caller is supposed to release the memory context. */ void DestroyPartitionDirectory(PartitionDirectory pdir) { - HASH_SEQ_STATUS status; - PartitionDirectoryEntry *pde; - - hash_seq_init(&status, pdir->pdir_hash); - while ((pde = hash_seq_search(&status)) != NULL) - RelationDecrementReferenceCount(pde->rel); } /* diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index c8770cc..b7ddbe5 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -782,10 +782,13 @@ partition_bounds_copy(PartitionBoundInfo src, PartitionKey key) { PartitionBoundInfo dest; - int i; + bool hash_part = (key->strategy == PARTITION_STRATEGY_HASH); + Datum *alldatums; int ndatums; int partnatts; + int natts; int num_indexes; + int i; dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData)); @@ -793,65 +796,69 @@ partition_bounds_copy(PartitionBoundInfo src, ndatums = dest->ndatums = src->ndatums; partnatts = key->partnatts; - num_indexes = get_partition_bound_num_indexes(src); - /* List partitioned tables have only a single partition key. */ Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1); - dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums); - if (src->kind != NULL) { + PartitionRangeDatumKind *allkinds; + dest->kind = (PartitionRangeDatumKind **) palloc(ndatums * sizeof(PartitionRangeDatumKind *)); + allkinds = (PartitionRangeDatumKind *) palloc(ndatums * partnatts * + sizeof(PartitionRangeDatumKind)); + for (i = 0; i < ndatums; i++) { - dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts * - sizeof(PartitionRangeDatumKind)); + dest->kind[i] = allkinds; + allkinds += partnatts; memcpy(dest->kind[i], src->kind[i], - sizeof(PartitionRangeDatumKind) * key->partnatts); + sizeof(PartitionRangeDatumKind) * partnatts); } } else dest->kind = NULL; + dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums); + + /* + * For hash partitioning, datums arrays will have two elements - modulus + * and remainder. + */ + natts = hash_part ? 2 : partnatts; + + alldatums = (Datum *) palloc(sizeof(Datum) * natts * ndatums); + for (i = 0; i < ndatums; i++) { - int j; + dest->datums[i] = alldatums; + alldatums += natts; /* - * For a corresponding to hash partition, datums array will have two - * elements - modulus and remainder. + * For hash partitioning, we know that both datums are present and + * pass-by-value (since they're int4s), so just memcpy the datum + * array. Otherwise we have to apply datumCopy. */ - bool hash_part = (key->strategy == PARTITION_STRATEGY_HASH); - int natts = hash_part ? 2 : partnatts; - - dest->datums[i] = (Datum *) palloc(sizeof(Datum) * natts); - - for (j = 0; j < natts; j++) + if (hash_part) { - bool byval; - int typlen; - - if (hash_part) - { - typlen = sizeof(int32); /* Always int4 */ - byval = true; /* int4 is pass-by-value */ - } - else + memcpy(dest->datums[i], src->datums[i], 2 * sizeof(Datum)); + } + else + { + for (int j = 0; j < natts; j++) { - byval = key->parttypbyval[j]; - typlen = key->parttyplen[j]; + if (dest->kind == NULL || + dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE) + dest->datums[i][j] = datumCopy(src->datums[i][j], + key->parttypbyval[j], + key->parttyplen[j]); } - - if (dest->kind == NULL || - dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE) - dest->datums[i][j] = datumCopy(src->datums[i][j], - byval, typlen); } } + num_indexes = get_partition_bound_num_indexes(src); + dest->indexes = (int *) palloc(sizeof(int) * num_indexes); memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);
On 2019/04/14 0:53, Tom Lane wrote: > Amit Langote <amitlangote09@gmail.com> writes: >> On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I concluded that that's not parenthetical but pretty relevant... > >> Hmm, do you mean we should grow the same "keep_" logic for >> rd_partcheck as we have for rd_partkey and rd_partdesc? I don't see >> it in the updated patch though. > > No, the "keep_" stuff is only necessary when we're trying to preserve > data structures in-place, which is only important if non-relcache > code might be using pointers to it. Since rd_partcheck is never > directly accessed by external code, only copied, there can't be any > live pointers to it to worry about. Besides which, since it's load > on demand rather than something that RelationBuildDesc forces to be > valid immediately, any comparison would be guaranteed to fail: the > field in the new reldesc will always be empty at this point. Ah, that's right. It was just that you were replying to this: Robert wrote: > As a parenthetical note, I observe that relcache.c seems to know > almost nothing about rd_partcheck. rd_partkey and rd_partdesc both > have handling in RelationClearRelation(), but rd_partcheck does not, > and I suspect that's wrong. Maybe I just got confused. > Perhaps there's an argument that it should be load-immediately not > load-on-demand, but that would be an optimization not a bug fix, > and I'm skeptical that it'd be an improvement anyway. Makes sense. > Probably this is something to revisit whenever somebody gets around to > addressing the whole copy-vs-dont-copy-vs-use-a-refcount business that > we were handwaving about upthread. OK. >>> I also cleaned up the problem the code had with failing to distinguish >>> "partcheck list is NIL" from "partcheck list hasn't been computed yet". >>> For a relation with no such constraints, generate_partition_qual would do >>> the full pushups every time. > >> Actually, callers must have checked that the table is a partition >> (relispartition). > > That does not mean that it's guaranteed to have any partcheck constraints; > there are counterexamples in the regression tests. It looks like the main > case is a LIST-partitioned table that has only a default partition. Ah, yes. Actually, even a RANGE default partition that's the only partition of its parent has NIL partition constraint. Thanks, Amit
On 2019/04/15 2:38, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/04/10 15:42, Michael Paquier wrote: >>> It seems to me that Tom's argument to push in the way relcache >>> information is handled by copying its contents sounds sensible to me. >>> That's not perfect, but it is consistent with what exists (note >>> PublicationActions for a rather-still-not-much-complex example of >>> structure which gets copied from the relcache). > >> I gave my vote for direct access of unchangeable relcache substructures >> (TupleDesc, PartitionDesc, etc.), because they're accessed during the >> planning of every user query and fairly expensive to copy compared to list >> of indexes or PublicationActions that you're citing. To further my point >> a bit, I wonder why PublicationActions needs to be copied out of relcache >> at all? Looking at its usage in execReplication.c, it seems we can do >> fine without copying, because we are holding both a lock and a relcache >> reference on the replication target relation during the entirety of >> apply_handle_insert(), which means that information can't change under us, >> neither logically nor physically. > > So the point here is that that reasoning is faulty. You *cannot* assume, > no matter how strong a lock or how many pins you hold, that a relcache > entry will not get rebuilt underneath you. Cache flushes happen > regardless. And unless relcache.c takes special measures to prevent it, > a rebuild will result in moving subsidiary data structures and thereby > breaking any pointers you may have pointing into those data structures. > > For certain subsidiary structures such as the relation tupdesc, > we do take such special measures: that's what the "keep_xxx" dance in > RelationClearRelation is. However, that's expensive, both in cycles > and maintenance effort: it requires having code that can decide equality > of the subsidiary data structures, which we might well have no other use > for, and which we certainly don't have strong tests for correctness of. > It's also very error-prone for callers, because there isn't any good way > to cross-check that code using a long-lived pointer to a subsidiary > structure is holding a lock that's strong enough to guarantee non-mutation > of that structure, or even that relcache.c provides any such guarantee > at all. (If our periodic attempts to reduce lock strength for assorted > DDL operations don't scare the pants off you in this connection, you have > not thought hard enough about it.) So I think that even though we've > largely gotten away with this approach so far, it's also a half-baked > kluge that we should be looking to get rid of, not extend to yet more > cases. Thanks for the explanation. I understand that simply having a lock and a nonzero refcount on a relation doesn't prevent someone else from changing it concurrently. I get that we want to get rid of the keep_* kludge in the long term, but is it wrong to think, for example, that having keep_partdesc today allows us today to keep the pointer to rd_partdesc as long as we're holding the relation open or refcnt on the whole relation such as with PartitionDirectory mechanism? > To my mind there are only two trustworthy solutions to the problem of > wanting time-extended usage of a relcache subsidiary data structure: one > is to copy it, and the other is to reference-count it. I think that going > over to a reference-count-based approach for many of these structures > might well be something we should do in future, maybe even the very near > future. In the mean time, though, I'm not really satisfied with inserting > half-baked kluges, especially not ones that are different from our other > half-baked kluges for similar purposes. I think that's a path to creating > hard-to-reproduce bugs. +1 to reference-count-based approach. I've occasionally wondered why there is both keep_tupdesc kludge and refcounting for table TupleDescs. I guess it's because *only* the TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is, refcounting) and the rest of the sites depend on the shaky guarantee provided by keep_tupdesc. Thanks, Amit
On Mon, Apr 15, 2019 at 5:05 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2019/04/15 2:38, Tom Lane wrote: > > So the point here is that that reasoning is faulty. You *cannot* assume, > > no matter how strong a lock or how many pins you hold, that a relcache > > entry will not get rebuilt underneath you. Cache flushes happen > > regardless. And unless relcache.c takes special measures to prevent it, > > a rebuild will result in moving subsidiary data structures and thereby > > breaking any pointers you may have pointing into those data structures. > > > > For certain subsidiary structures such as the relation tupdesc, > > we do take such special measures: that's what the "keep_xxx" dance in > > RelationClearRelation is. However, that's expensive, both in cycles > > and maintenance effort: it requires having code that can decide equality > > of the subsidiary data structures, which we might well have no other use > > for, and which we certainly don't have strong tests for correctness of. > > It's also very error-prone for callers, because there isn't any good way > > to cross-check that code using a long-lived pointer to a subsidiary > > structure is holding a lock that's strong enough to guarantee non-mutation > > of that structure, or even that relcache.c provides any such guarantee > > at all. (If our periodic attempts to reduce lock strength for assorted > > DDL operations don't scare the pants off you in this connection, you have > > not thought hard enough about it.) So I think that even though we've > > largely gotten away with this approach so far, it's also a half-baked > > kluge that we should be looking to get rid of, not extend to yet more > > cases. > > Thanks for the explanation. > > I understand that simply having a lock and a nonzero refcount on a > relation doesn't prevent someone else from changing it concurrently. > > I get that we want to get rid of the keep_* kludge in the long term, but > is it wrong to think, for example, that having keep_partdesc today allows > us today to keep the pointer to rd_partdesc as long as we're holding the > relation open or refcnt on the whole relation such as with > PartitionDirectory mechanism? Ah, we're also trying to fix the memory leak caused by the current design of PartitionDirectory. AIUI, the design assumes that the leak would occur in fairly rare cases, but maybe not so? If partitions are frequently attached/detached concurrently (maybe won't be too uncommon if reduced lock levels encourages users) causing the PartitionDesc of a given relation changing all the time, then a planning session that's holding the PartitionDirectory containing that relation would leak as many PartitionDescs as there were concurrent changes, I guess. I see that you've proposed to change the PartitionDirectory design to copy PartitionDesc as way of keeping it around instead holding the relation open, but having to resort to that would be unfortunate. Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/15 2:38, Tom Lane wrote: >> To my mind there are only two trustworthy solutions to the problem of >> wanting time-extended usage of a relcache subsidiary data structure: one >> is to copy it, and the other is to reference-count it. I think that going >> over to a reference-count-based approach for many of these structures >> might well be something we should do in future, maybe even the very near >> future. In the mean time, though, I'm not really satisfied with inserting >> half-baked kluges, especially not ones that are different from our other >> half-baked kluges for similar purposes. I think that's a path to creating >> hard-to-reproduce bugs. > +1 to reference-count-based approach. > I've occasionally wondered why there is both keep_tupdesc kludge and > refcounting for table TupleDescs. I guess it's because *only* the > TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is, > refcounting) and the rest of the sites depend on the shaky guarantee > provided by keep_tupdesc. The reason for that is simply that at the time we added TupleDesc refcounts, we didn't want to do the extra work of making all uses of relcache entries' tupdescs deal with refcounting; keep_tupdesc is certainly a kluge, but it works for an awful lot of callers. We'd have to go back and deal with that more honestly if we go down this path. I'm inclined to think we could still allow many call sites to not do an incr/decr-refcount dance as long as they're just fetching scalar values out of the relcache's tupdesc, and not keeping any pointer into it. However, it's hard to see how to enforce such a rule mechanically, so it might be impractically error-prone to allow that. regards, tom lane
Amit Langote <amitlangote09@gmail.com> writes: >> I get that we want to get rid of the keep_* kludge in the long term, but >> is it wrong to think, for example, that having keep_partdesc today allows >> us today to keep the pointer to rd_partdesc as long as we're holding the >> relation open or refcnt on the whole relation such as with >> PartitionDirectory mechanism? Well, it's safe from the caller's standpoint as long as a suitable lock is being held, which is neither well-defined nor enforced in any way :-( > Ah, we're also trying to fix the memory leak caused by the current > design of PartitionDirectory. AIUI, the design assumes that the leak > would occur in fairly rare cases, but maybe not so? If partitions are > frequently attached/detached concurrently (maybe won't be too uncommon > if reduced lock levels encourages users) causing the PartitionDesc of > a given relation changing all the time, then a planning session that's > holding the PartitionDirectory containing that relation would leak as > many PartitionDescs as there were concurrent changes, I guess. We should get a relcache inval after a partdesc change, but the problem with the current code is that that will only result in freeing the old partdesc if the inval event is processed while the relcache entry has refcount zero. Otherwise the old rd_pdcxt is just shoved onto the context chain, where it could survive indefinitely. I'm not sure that this is really a huge problem in practice. The example I gave upthread shows that a partdesc-changing transaction's own internal invals do arrive during CommandCounterIncrement calls that occur while the relcache pin is held; but it seems a bit artificial to assume that one transaction would do a huge number of such changes. (Although, hm, maybe a single-transaction pg_restore run could have an issue.) Once out of the transaction, it's okay because we'll again invalidate the entry at the start of the next transaction, and then the refcount will be zero and we'll clean up. For other sessions it'd only happen if they saw the inval while already holding a pin on the partitioned table, which probably requires some unlucky timing; and that'd have to happen repeatedly to have a leak that amounts to anything. Still, though, I'm unhappy with the code as it stands. It's risky to assume that it has no unpleasant behaviors that we haven't spotted yet but will manifest after v12 is in the field. And I do not think that it represents a solid base to build on. (As an example, if we made any effort to get rid of the redundant extra inval events that occur post-transaction, we'd suddenly have a much worse problem here.) I'd rather go over to the copy-based solution for now, which *is* semantically sound, and accept that we still have more performance work to do. It's not like v12 isn't going to be light-years ahead of v11 in this area anyway. regards, tom lane
On Sun, Apr 14, 2019 at 3:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > What I get for test cases like [1] is > > single-partition SELECT, hash partitioning: > > N tps, HEAD tps, patch > 2 11426.243754 11448.615193 > 8 11254.833267 11374.278861 > 32 11288.329114 11371.942425 > 128 11222.329256 11185.845258 > 512 11001.177137 10572.917288 > 1024 10612.456470 9834.172965 > 4096 8819.110195 7021.864625 > 8192 7372.611355 5276.130161 > > single-partition SELECT, range partitioning: > > N tps, HEAD tps, patch > 2 11037.855338 11153.595860 > 8 11085.218022 11019.132341 > 32 10994.348207 10935.719951 > 128 10884.417324 10532.685237 > 512 10635.583411 9578.108915 > 1024 10407.286414 8689.585136 > 4096 8361.463829 5139.084405 > 8192 7075.880701 3442.542768 I have difficulty interpreting these results in any way other than as an endorsement of the approach that I took. It seems like you're proposing to throw away what is really a pretty substantial amount of performance basically so that the code will look more like you think it should look. But I dispute the idea that the current code is so bad that we need to do this. I don't think that's the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/04/15 4:29, Tom Lane wrote: > I think that what we ought to do for v12 is have PartitionDirectory > copy the data, and then in v13 work on creating real reference-count > infrastructure that would allow eliminating the copy steps with full > safety. The $64 question is whether that really would cause unacceptable > performance problems. To look into that, I made the attached WIP patches. > (These are functionally complete, but I didn't bother for instance with > removing the hunk that 898e5e329 added to relcache.c, and the comments > need work, etc.) The first one just changes the PartitionDirectory > code to do that, and then the second one micro-optimizes > partition_bounds_copy() to make it somewhat less expensive, mostly by > collapsing lots of small palloc's into one big one. Thanks for the patches. The partition_bound_copy()-micro-optimize one looks good in any case. > What I get for test cases like [1] is > > single-partition SELECT, hash partitioning: > > N tps, HEAD tps, patch > 2 11426.243754 11448.615193 > 8 11254.833267 11374.278861 > 32 11288.329114 11371.942425 > 128 11222.329256 11185.845258 > 512 11001.177137 10572.917288 > 1024 10612.456470 9834.172965 > 4096 8819.110195 7021.864625 > 8192 7372.611355 5276.130161 > > single-partition SELECT, range partitioning: > > N tps, HEAD tps, patch > 2 11037.855338 11153.595860 > 8 11085.218022 11019.132341 > 32 10994.348207 10935.719951 > 128 10884.417324 10532.685237 > 512 10635.583411 9578.108915 > 1024 10407.286414 8689.585136 > 4096 8361.463829 5139.084405 > 8192 7075.880701 3442.542768 > > Now certainly these numbers suggest that avoiding the copy could be worth > our trouble, but these results are still several orders of magnitude > better than where we were two weeks ago [2]. Plus, this is an extreme > case that's not really representative of real-world usage, since the test > tables have neither indexes nor any data. I tested the copyPartitionDesc() patch and here are the results for single-partition SELECT using hash partitioning, where index on queries column, and N * 1000 rows inserted into the parent table before the test. I've confirmed that the plan is always an Index Scan on selected partition (in PG 11, it's under Append, but in HEAD there's no Append due to 8edd0e794) N tps, HEAD tps, patch tps, PG 11 2 3093.443043 3039.804101 2928.777570 8 3024.545820 3064.333027 2372.738622 32 3029.580531 3032.755266 1417.706212 128 3019.359793 3032.726006 567.099745 512 2948.639216 2986.987862 98.710664 1024 2971.629939 2882.233026 41.720955 4096 2680.703000 1937.988908 7.035816 8192 2599.120308 2069.271274 3.635512 So, the TPS degrades by 14% when going from 128 partitions to 8192 partitions on HEAD, whereas it degrades by 31% with the patch. Here are the numbers with no indexes defined on the tables, and no data inserted. N tps, HEAD tps, patch tps, PG 11 2 3498.247862 3463.695950 3110.314290 8 3524.430780 3445.165206 2741.340770 32 3476.427781 3427.400879 1645.602269 128 3427.121901 3430.385433 651.586373 512 3394.907072 3335.842183 182.201349 1024 3454.050819 3274.266762 67.942075 4096 3201.266380 2845.974556 12.320716 8192 2955.850804 2413.723443 6.151703 Here, the TPS degrades by 13% when going from 128 partitions to 8192 partitions on HEAD, whereas it degrades by 29% with the patch. So, the degradation caused by copying the bounds is almost same in both cases. Actually, even in the more realistic test with indexes and data, executing the plan is relatively faster than planning as the partition count grows, because the PartitionBoundInfo that the planner now copies grows bigger. Thanks, Amit
On 2019/04/17 18:58, Amit Langote wrote: > where index on queries > column Oops, I meant "with an index on the queried column". Thanks, Amit
Hi The message I'm replying to is marked as an open item. Robert, what do you think needs to be done here before release? Could you summarize, so we then can see what others think? Greetings, Andres Freund
On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote: > The message I'm replying to is marked as an open item. Robert, what do > you think needs to be done here before release? Could you summarize, > so we then can see what others think? The original issue on this thread was that hyrax started running out of memory when it hadn't been doing so previously. That happened because, for complicated reasons, commit 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once. Commits 2455ab48844c90419714e27eafd235a85de23232 and d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem. In the email at issue, Tom complains about two things. First, he complains about the fact that I try to arrange things so that relcache data remains valid for as long as required instead of just copying it. Second, he complains about the fact repeated ATTACH and DETACH PARTITION operations can produce a slow session-lifespan memory leak. I think it's reasonable to fix the second issue for v12. I am not sure how important it is, because (1) the leak is small, (2) it seems unlikely that anyone would execute enough ATTACH/DETACH PARTITION operations in one backend for the leak to amount to anything significant, and (3) if a relcache flush ever happens when you don't have the relation open, all of the "leaked" memory will be un-leaked. However, I believe that we could fix it as follows. First, invent rd_pdoldcxt and put the first old context there; if that pointer is already in use, then parent the new context under the old one. Second, in RelationDecrementReferenceCount, if the refcount hits 0, nuke rd_pdoldcxt and set the pointer back to NULL. With that change, you would only keep the old PartitionDesc around until the ref count hits 0, whereas at present, you keep the old PartitionDesc around until an invalidation happens while the ref count is 0. I think the first issue is not v12 material. Tom proposed to fix it by copying all the relevant data out of the relcache, but his own performance results show a pretty significant hit, and AFAICS he hasn't pointed to anything that's actually broken by the current approach. What I think should be done is actually generalize the approach I took in this patch, so that instead of the partition directory holding a reference count, the planner or executor holds one. Then not only could people who want information about the PartitionDesc avoid copying stuff from the relcache, but maybe other things as well. I think that would be significantly better than continuing to double-down on the current copy-everything approach, which really only works well in a world where a table can't have all that much metadata, which is clearly not true for PostgreSQL any more. I'm not sure that Tom is actually opposed to this approach -- although I may have misunderstood his position -- but where we disagree, I think, is that I see what I did in this commit as a stepping-stone toward a better world, and he sees it as something that should be killed with fire until that better world has fully arrived. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-05-01 13:09:07 -0400, Robert Haas wrote: > The original issue on this thread was that hyrax started running out > of memory when it hadn't been doing so previously. That happened > because, for complicated reasons, commit > 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being > hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once. > Commits 2455ab48844c90419714e27eafd235a85de23232 and > d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem. > > In the email at issue, Tom complains about two things. First, he > complains about the fact that I try to arrange things so that relcache > data remains valid for as long as required instead of just copying it. > Second, he complains about the fact repeated ATTACH and DETACH > PARTITION operations can produce a slow session-lifespan memory leak. > > I think it's reasonable to fix the second issue for v12. I am not > sure how important it is, because (1) the leak is small, (2) it seems > unlikely that anyone would execute enough ATTACH/DETACH PARTITION > operations in one backend for the leak to amount to anything > significant, and (3) if a relcache flush ever happens when you don't > have the relation open, all of the "leaked" memory will be un-leaked. > However, I believe that we could fix it as follows. First, invent > rd_pdoldcxt and put the first old context there; if that pointer is > already in use, then parent the new context under the old one. > Second, in RelationDecrementReferenceCount, if the refcount hits 0, > nuke rd_pdoldcxt and set the pointer back to NULL. With that change, > you would only keep the old PartitionDesc around until the ref count > hits 0, whereas at present, you keep the old PartitionDesc around > until an invalidation happens while the ref count is 0. That sounds roughly reasonably. Tom, any objections? I think it'd be good if we fixed this soon. > I think the first issue is not v12 material. Tom proposed to fix it > by copying all the relevant data out of the relcache, but his own > performance results show a pretty significant hit, Yea, the numbers didn't look very convincing. > and AFAICS he > hasn't pointed to anything that's actually broken by the current > approach. What I think should be done is actually generalize the > approach I took in this patch, so that instead of the partition > directory holding a reference count, the planner or executor holds > one. Then not only could people who want information about the > PartitionDesc avoid copying stuff from the relcache, but maybe other > things as well. I think that would be significantly better than > continuing to double-down on the current copy-everything approach, > which really only works well in a world where a table can't have all > that much metadata, which is clearly not true for PostgreSQL any more. > I'm not sure that Tom is actually opposed to this approach -- although > I may have misunderstood his position -- but where we disagree, I > think, is that I see what I did in this commit as a stepping-stone > toward a better world, and he sees it as something that should be > killed with fire until that better world has fully arrived. Tom, are you ok with deferring further work here for v13? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-01 13:09:07 -0400, Robert Haas wrote: >> In the email at issue, Tom complains about two things. First, he >> complains about the fact that I try to arrange things so that relcache >> data remains valid for as long as required instead of just copying it. >> Second, he complains about the fact repeated ATTACH and DETACH >> PARTITION operations can produce a slow session-lifespan memory leak. >> >> I think it's reasonable to fix the second issue for v12. I am not >> sure how important it is, because (1) the leak is small, (2) it seems >> unlikely that anyone would execute enough ATTACH/DETACH PARTITION >> operations in one backend for the leak to amount to anything >> significant, and (3) if a relcache flush ever happens when you don't >> have the relation open, all of the "leaked" memory will be un-leaked. Yeah, I did some additional testing that showed that it's pretty darn hard to get the leak to amount to anything. The test case that I originally posted did many DDLs in a single transaction, and it seems that that's actually essential to get a meaningful leak; as soon as you exit the transaction the leaked contexts will be recovered during sinval cleanup. >> However, I believe that we could fix it as follows. First, invent >> rd_pdoldcxt and put the first old context there; if that pointer is >> already in use, then parent the new context under the old one. >> Second, in RelationDecrementReferenceCount, if the refcount hits 0, >> nuke rd_pdoldcxt and set the pointer back to NULL. With that change, >> you would only keep the old PartitionDesc around until the ref count >> hits 0, whereas at present, you keep the old PartitionDesc around >> until an invalidation happens while the ref count is 0. > That sounds roughly reasonably. Tom, any objections? I think it'd be > good if we fixed this soon. My fundamental objection is that this kluge is ugly as sin. Adding more ugliness on top of it doesn't improve that; rather the opposite. > Tom, are you ok with deferring further work here for v13? Yeah, I think that that's really what we ought to do at this point. I'd like to see a new design here, but it's not happening for v12, and I don't want to add even more messiness that's predicated on a wrong design. regards, tom lane
On Fri, May 17, 2019 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I did some additional testing that showed that it's pretty darn > hard to get the leak to amount to anything. The test case that I > originally posted did many DDLs in a single transaction, and it > seems that that's actually essential to get a meaningful leak; as > soon as you exit the transaction the leaked contexts will be recovered > during sinval cleanup. My colleague Amul Sul rediscovered this same leak when he tried to attach lots of partitions to an existing partitioned table, all in the course of a single transaction. This seems a little less artificial than Tom's original reproducer, which involved attaching and detaching the same partition repeatedly. Here is a patch that tries to fix this, along the lines I previously suggested; Amul reports that it does work for him. I am OK to hold this for v13 if that's what people want, but I think it might be smarter to commit it to v12. Maybe it's not a big leak, but it seems easy enough to do better, so I think we should. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jun 4, 2019 at 9:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 17, 2019 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Yeah, I did some additional testing that showed that it's pretty darn > > hard to get the leak to amount to anything. The test case that I > > originally posted did many DDLs in a single transaction, and it > > seems that that's actually essential to get a meaningful leak; as > > soon as you exit the transaction the leaked contexts will be recovered > > during sinval cleanup. > > My colleague Amul Sul rediscovered this same leak when he tried to > attach lots of partitions to an existing partitioned table, all in the > course of a single transaction. This seems a little less artificial > than Tom's original reproducer, which involved attaching and detaching > the same partition repeatedly. > > Here is a patch that tries to fix this, along the lines I previously > suggested; Amul reports that it does work for him. Thanks for the patch. I noticed a crash with one of the scenarios that I think the patch is meant to address. Let me describe the steps: Attach gdb (or whatever) to session 1 in which we'll run a query that will scan at least two partitions and set a breakpoint in expand_partitioned_rtentry(). Run the query, so the breakpoint will be hit. Step through up to the start of the following loop in this function: i = -1; while ((i = bms_next_member(live_parts, i)) >= 0) { Oid childOID = partdesc->oids[i]; Relation childrel; RangeTblEntry *childrte; Index childRTindex; RelOptInfo *childrelinfo; /* Open rel, acquiring required locks */ childrel = table_open(childOID, lockmode); Note that 'partdesc' in the above code is from the partition directory. Before stepping through into the loop, start another session and attach a new partition. On into the loop. When the 1st partition is opened, its locking will result in RelationClearRelation() being called on the parent relation due to partition being attached concurrently, which I observed, is actually invoked a couple of times due to recursion. Parent relation's rd_oldpdcxt will be set in this process, which contains the aforementioned partition descriptor. Before moving the loop to process the 2nd partition, attach another partition in session 2. When the 2nd partition is opened, RelationClearRelation() will run again and in one of its recursive invocations, it destroys the rd_oldpdcxt that was set previously, taking the partition directory's partition descriptor with it. Anything that tries to access it later crashes. I couldn't figure out what to blame here though -- the design of rd_pdoldcxt or the fact that RelationClearRelation() is invoked multiple times. I will try to study it more closely tomorrow. Thanks, Amit
On Wed, Jun 5, 2019 at 8:03 PM Amit Langote <amitlangote09@gmail.com> wrote: > I noticed a crash with one of the scenarios that I think the patch is > meant to address. Let me describe the steps: > > Attach gdb (or whatever) to session 1 in which we'll run a query that > will scan at least two partitions and set a breakpoint in > expand_partitioned_rtentry(). Run the query, so the breakpoint will > be hit. Step through up to the start of the following loop in this > function: > > i = -1; > while ((i = bms_next_member(live_parts, i)) >= 0) > { > Oid childOID = partdesc->oids[i]; > Relation childrel; > RangeTblEntry *childrte; > Index childRTindex; > RelOptInfo *childrelinfo; > > /* Open rel, acquiring required locks */ > childrel = table_open(childOID, lockmode); > > Note that 'partdesc' in the above code is from the partition > directory. Before stepping through into the loop, start another > session and attach a new partition. On into the loop. When the 1st > partition is opened, its locking will result in > RelationClearRelation() being called on the parent relation due to > partition being attached concurrently, which I observed, is actually > invoked a couple of times due to recursion. Parent relation's > rd_oldpdcxt will be set in this process, which contains the > aforementioned partition descriptor. > > Before moving the loop to process the 2nd partition, attach another > partition in session 2. When the 2nd partition is opened, > RelationClearRelation() will run again and in one of its recursive > invocations, it destroys the rd_oldpdcxt that was set previously, > taking the partition directory's partition descriptor with it. > Anything that tries to access it later crashes. > > I couldn't figure out what to blame here though -- the design of > rd_pdoldcxt or the fact that RelationClearRelation() is invoked > multiple times. I will try to study it more closely tomorrow. On further study, I concluded that it's indeed the multiple invocations of RelationClearRelation() on the parent relation that causes rd_pdoldcxt to be destroyed prematurely. While it's problematic that the session in which a new partition is attached to the parent relation broadcasts 2 SI messages to invalidate the parent relation [1], it's obviously better to fix how RelationClearRelation manipulates rd_pdoldcxt so that duplicate SI messages are not harmful, fixing the latter is hardly an option. Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, which seems to fix this issue for me. In the rare case where inval messages due to multiple concurrent attachments get processed in a session holding a reference on a partitioned table, there will be multiple "old" partition descriptors and corresponding "old" contexts. All of the old contexts are chained together via context-re-parenting, with the newest "old" context being accessible from the table's rd_pdoldcxt. Thanks, Amit [1]: inval.c: AddRelcacheInvalidationMessage() does try to prevent duplicate messages from being emitted, but the logic to detect duplicates doesn't work across CommandCounterIncrement(). There are at two relcache inval requests in ATTACH PARTITION code path, emitted by SetRelationHasSubclass and StorePartitionBound, resp., which are separated by at least one CCI, so the 2nd request isn't detected as the duplicate of the 1st.
Attachment
On Thu, Jun 6, 2019 at 3:47 PM Amit Langote <amitlangote09@gmail.com> wrote: > While it's > problematic that the session in which a new partition is attached to > the parent relation broadcasts 2 SI messages to invalidate the parent > relation [1], it's obviously better to fix how RelationClearRelation > manipulates rd_pdoldcxt so that duplicate SI messages are not harmful, > fixing the latter is hardly an option. Sorry, I had meant to say "fixing the former/the other is hardly an option". Thanks, Amit
On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote: > Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, > which seems to fix this issue for me. Yeah, that looks right. I think my patch was full of fuzzy thinking and inadequate testing; thanks for checking it over and coming up with the right solution. Anyone else want to look/comment? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote: >> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, >> which seems to fix this issue for me. > Yeah, that looks right. I think my patch was full of fuzzy thinking > and inadequate testing; thanks for checking it over and coming up with > the right solution. > Anyone else want to look/comment? I think the existing code is horribly ugly and this is even worse. It adds cycles to RelationDecrementReferenceCount which is a hotspot that has no business dealing with this; the invariants are unclear; and there's no strong reason to think there aren't still cases where we accumulate lots of copies of old partition descriptors during a sequence of operations. Basically you're just doubling down on a wrong design. As I said upthread, my current inclination is to do nothing in this area for v12 and then try to replace the whole thing with proper reference counting in v13. I think the cases where we have a major leak are corner-case-ish enough that we can leave it as-is for one release. regards, tom lane
On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the existing code is horribly ugly and this is even worse. > It adds cycles to RelationDecrementReferenceCount which is a hotspot > that has no business dealing with this; the invariants are unclear; > and there's no strong reason to think there aren't still cases where > we accumulate lots of copies of old partition descriptors during a > sequence of operations. Basically you're just doubling down on a > wrong design. I don't understand why a function that decrements a reference count shouldn't be expected to free things if the reference count goes to zero. Under what workload do you think adding this to RelationDecrementReferenceCount would cause a noticeable performance regression? I think the change is responsive to your previous complaint that the timing of stuff getting freed is not very well pinned down. With this change, it's much more tightly pinned down: it happens when the refcount goes to 0. That is definitely not perfect, but I think that it is a lot easier to come up with scenarios where the leak accumulates because no cache flush happens while the relfcount is 0 than it is to come up with scenarios where the refcount never reaches 0. I agree that the latter type of scenario probably exists, but I don't think we've come up with one yet. > As I said upthread, my current inclination is to do nothing in this > area for v12 and then try to replace the whole thing with proper > reference counting in v13. I think the cases where we have a major > leak are corner-case-ish enough that we can leave it as-is for one > release. Is this something you're planning to work on yourself? Do you have a design in mind? Is the idea to reference-count the PartitionDesc? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the change is responsive to your previous complaint that the > timing of stuff getting freed is not very well pinned down. With this > change, it's much more tightly pinned down: it happens when the > refcount goes to 0. That is definitely not perfect, but I think that > it is a lot easier to come up with scenarios where the leak > accumulates because no cache flush happens while the relfcount is 0 > than it is to come up with scenarios where the refcount never reaches > 0. I agree that the latter type of scenario probably exists, but I > don't think we've come up with one yet. I don't know why you think that's improbable, given that the changes around PartitionDirectory-s cause relcache entries to be held open much longer than before (something I've also objected to on this thread). >> As I said upthread, my current inclination is to do nothing in this >> area for v12 and then try to replace the whole thing with proper >> reference counting in v13. I think the cases where we have a major >> leak are corner-case-ish enough that we can leave it as-is for one >> release. > Is this something you're planning to work on yourself? Well, I'd rather farm it out to somebody else, but ... > Do you have a > design in mind? Is the idea to reference-count the PartitionDesc? What we discussed upthread was refcounting each of the various large sub-objects of relcache entries, not just the partdesc. I think if we're going to go this way we should bite the bullet and fix them all. I really want to burn down RememberToFreeTupleDescAtEOX() in particular ... it seems likely to me that that's also a source of unpleasant memory leaks. regards, tom lane