Thread: hyrax vs. RelationBuildPartitionDesc

hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 16:13:46

Hi,

Amit Kapila pointed out to be that there are some buidfarm failures on
hyrax which seem to have started happening around the time I committed
898e5e3290a72d288923260143930fb32036c00c.  It failed like this once:

2019-03-07 19:57:40.231 EST [28073:11] DETAIL:  Failed process was
running: /* same as above */
        explain (costs off) select * from rlp where a = 1::numeric;

And then the next three runs failed like this:

2019-03-09 04:56:04.939 EST [1727:4] LOG:  server process (PID 2898)
was terminated by signal 9: Killed
2019-03-09 04:56:04.939 EST [1727:5] DETAIL:  Failed process was
running: UPDATE range_parted set c = (case when c = 96 then 110 else c
+ 1 end ) WHERE a = 'b' and b > 10 and c >= 96;

I couldn't think of an explanation for this other than that the
process involved had used a lot of memory and gotten killed by the OOM
killer, and it turns out that RelationBuildPartitionDesc leaks memory
into the surrounding context every time it's called.  It's not a lot
of memory, but in a CLOBBER_CACHE_ALWAYS build it adds up, because
this function gets called a lot.  All of this is true even before the
commit in question, but for some reason that I don't understand that
commit makes it worse.

I tried having that function use a temporary context for its scratch
space and that causes a massive drop in memory usage when running
update.sql frmo the regression tests under CLOBBER_CACHE_ALWAYS.  I
ran MacOS X's vmmap tool to see the impact, measuring the size of the
"DefaultMallocZone".  Without this patch, that peaks at >450MB; with
this patch it peaks ~270MB.  There is a significant reduct in typical
memory usage, too.  It's noticeably better with this patch than it was
before 898e5e3290a72d288923260143930fb32036c00c.

I'm not sure whether this is the right way to address the problem.
RelationBuildPartitionDesc() creates basically all of the data
structures it needs and then copies them into rel->rd_pdcxt, which has
always seemed a bit inefficient to me.  Another way to redesign this
would be to have the function create a temporary context, do all of
its work there, and then reparent the context under CacheMemoryContext
at the end.  That means any leaks would go into a relatively
long-lifespan context, but on the other hand you wouldn't leak into
the same context a zillion times over, and you'd save the expense of
copying everything.  I think that the biggest thing that is being
copied around here is the partition bounds, so maybe the leak wouldn't
amount to much, and we could also do things like list_free(inhoids) to
make it a little tighter.

Opinions?  Suggestions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

rbcontext.patch

Re: hyrax vs. RelationBuildPartitionDesc

From

Alvaro Herrera

Date:

13 March 2019, 16:42:32

On 2019-Mar-13, Robert Haas wrote:

> RelationBuildPartitionDesc() creates basically all of the data
> structures it needs and then copies them into rel->rd_pdcxt, which has
> always seemed a bit inefficient to me.  Another way to redesign this
> would be to have the function create a temporary context, do all of
> its work there, and then reparent the context under CacheMemoryContext
> at the end.  That means any leaks would go into a relatively
> long-lifespan context, but on the other hand you wouldn't leak into
> the same context a zillion times over, and you'd save the expense of
> copying everything.  I think that the biggest thing that is being
> copied around here is the partition bounds, so maybe the leak wouldn't
> amount to much, and we could also do things like list_free(inhoids) to
> make it a little tighter.

I remember going over this code's memory allocation strategy a bit to
avoid the copy while not incurring potential leaks CacheMemoryContext;
as I recall, my idea was to use two contexts, one of which is temporary
and used for any potentially leaky callees, and destroyed at the end of
the function, and the other contains the good stuff and is reparented to
CacheMemoryContext at the end.  So if you have any accidental leaks,
they don't affect a long-lived context.  You have to be mindful of not
calling leaky code when you're using the permanent one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 16:51:05

On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I remember going over this code's memory allocation strategy a bit to
> avoid the copy while not incurring potential leaks CacheMemoryContext;
> as I recall, my idea was to use two contexts, one of which is temporary
> and used for any potentially leaky callees, and destroyed at the end of
> the function, and the other contains the good stuff and is reparented to
> CacheMemoryContext at the end.  So if you have any accidental leaks,
> they don't affect a long-lived context.  You have to be mindful of not
> calling leaky code when you're using the permanent one.

Well, that assumes that the functions which allocate the good stuff do
not also leak, which seems a bit fragile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 17:15:45

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I remember going over this code's memory allocation strategy a bit to
>> avoid the copy while not incurring potential leaks CacheMemoryContext;
>> as I recall, my idea was to use two contexts, one of which is temporary
>> and used for any potentially leaky callees, and destroyed at the end of
>> the function, and the other contains the good stuff and is reparented to
>> CacheMemoryContext at the end.  So if you have any accidental leaks,
>> they don't affect a long-lived context.  You have to be mindful of not
>> calling leaky code when you're using the permanent one.

> Well, that assumes that the functions which allocate the good stuff do
> not also leak, which seems a bit fragile.

I'm a bit confused as to why there's an issue here at all.  The usual
plan for computed-on-demand relcache sub-structures is that we compute
a working copy that we're going to return to the caller using the
caller's context (which is presumably statement-duration at most)
and then do the equivalent of copyObject to stash a long-lived copy
into the relcache context.  Is this case being done differently, and if
so why?  If it's being done the same, where are we leaking?

I recall having noticed someplace where I thought the relcache partition
support was simply failing to make provisions for cleaning up a cached
structure at relcache entry drop, but I didn't have time to pursue it
right then.  Let me see if I can reconstruct what I was worried about.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 17:27:08

On Wed, Mar 13, 2019 at 1:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm a bit confused as to why there's an issue here at all.  The usual
> plan for computed-on-demand relcache sub-structures is that we compute
> a working copy that we're going to return to the caller using the
> caller's context (which is presumably statement-duration at most)
> and then do the equivalent of copyObject to stash a long-lived copy
> into the relcache context.  Is this case being done differently, and if
> so why?  If it's being done the same, where are we leaking?

It's being done in just that way.  The caller's context is
MessageContext, which is indeed statement duration.  But if you plunk
10k into MessageContext a few thousand times per statement, then you
chew through a couple hundred meg of memory, and apparently hyrax
can't tolerate that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 17:30:52

I wrote:
> I recall having noticed someplace where I thought the relcache partition
> support was simply failing to make provisions for cleaning up a cached
> structure at relcache entry drop, but I didn't have time to pursue it
> right then.  Let me see if I can reconstruct what I was worried about.

Ah, here we are: it was rel->rd_partcheck.  I'm not sure exactly how
complicated that structure can be, but I'm pretty sure that this is
a laughably inadequate method of cleaning it up:

    if (relation->rd_partcheck)
        pfree(relation->rd_partcheck);

Having it be loose data in CacheMemoryContext, which is the current state
of affairs, is just not going to be practical to clean up.  I suggest that
maybe it could be copied into rd_partkeycxt or rd_pdcxt, so that it'd go
away as a byproduct of freeing those.  If there's a reason it has to be
independent of both, it'll have to have its own small context.

Dunno if that's related to hyrax's issue, though.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Alvaro Herrera

Date:

13 March 2019, 17:38:02

On 2019-Mar-13, Robert Haas wrote:

> On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I remember going over this code's memory allocation strategy a bit to
> > avoid the copy while not incurring potential leaks CacheMemoryContext;
> > as I recall, my idea was to use two contexts, one of which is temporary
> > and used for any potentially leaky callees, and destroyed at the end of
> > the function, and the other contains the good stuff and is reparented to
> > CacheMemoryContext at the end.  So if you have any accidental leaks,
> > they don't affect a long-lived context.  You have to be mindful of not
> > calling leaky code when you're using the permanent one.
> 
> Well, that assumes that the functions which allocate the good stuff do
> not also leak, which seems a bit fragile.

A bit, yes, but not overly so, and it's less fragile that not having
such a protection.  Anything that allocates in CacheMemoryContext needs
to be very careful anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 18:03:02

On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> A bit, yes, but not overly so, and it's less fragile that not having
> such a protection.  Anything that allocates in CacheMemoryContext needs
> to be very careful anyway.

True, but I think it's more fragile than either of the options I proposed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 18:07:27

On Wed, Mar 13, 2019 at 1:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ah, here we are: it was rel->rd_partcheck.  I'm not sure exactly how
> complicated that structure can be, but I'm pretty sure that this is
> a laughably inadequate method of cleaning it up:
>
>         if (relation->rd_partcheck)
>                 pfree(relation->rd_partcheck);

Oh, for crying out loud.

> Dunno if that's related to hyrax's issue, though.

It's related in the sense that it's a leak, and any leak will tend to
run the system out of memory more easily, but what I observed was a
large leak into MessageContext, and that would be a leak into
CacheMemoryContext, so I think it's probably a sideshow rather than
the main event.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Alvaro Herrera

Date:

13 March 2019, 18:21:45

On 2019-Mar-13, Robert Haas wrote:

> On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > A bit, yes, but not overly so, and it's less fragile that not having
> > such a protection.  Anything that allocates in CacheMemoryContext needs
> > to be very careful anyway.
> 
> True, but I think it's more fragile than either of the options I proposed.

You do?  Unless I misunderstood, your options are:

1. (the patch you attached) create a temporary memory context that is
used for everything, then at the end copy the good stuff to CacheMemCxt
(or a sub-context thereof).  This still needs to copy.

2. create a temp memory context, do everything there, do retail freeing
of everything we don't want, reparenting the context to CacheMemCxt.
Hope we didn't forget to pfree anything.

How is any of those superior to what I propose?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 18:26:28

Robert Haas <robertmhaas@gmail.com> writes:
>> Dunno if that's related to hyrax's issue, though.

> It's related in the sense that it's a leak, and any leak will tend to
> run the system out of memory more easily, but what I observed was a
> large leak into MessageContext, and that would be a leak into
> CacheMemoryContext, so I think it's probably a sideshow rather than
> the main event.

OK, in that case it's definitely all the temporary data that gets created
that is the problem.  I've not examined your patch in great detail but
it looks plausible for fixing that.

I think that RelationBuildPartitionDesc could use some additional cleanup
or at least better commenting.  In particular, it's neither documented nor
obvious to the naked eye why rel->rd_partdesc mustn't get set till the
very end.  As the complainant, I'm willing to go fix that, but do you want
to push your patch first so it doesn't get broken?  Or I could include
your patch in the cleanup.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 18:36:13

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> You do?  Unless I misunderstood, your options are:

> 1. (the patch you attached) create a temporary memory context that is
> used for everything, then at the end copy the good stuff to CacheMemCxt
> (or a sub-context thereof).  This still needs to copy.

> 2. create a temp memory context, do everything there, do retail freeing
> of everything we don't want, reparenting the context to CacheMemCxt.
> Hope we didn't forget to pfree anything.

> How is any of those superior to what I propose?

I doubt that what you're suggesting is terribly workable.  It's not
just RelationBuildPartitionDesc that's at issue.  Part of what will
be the long-lived data structure is made by partition_bounds_create,
and that invokes quite a boatload of code that both makes the final
data structure and leaks a lot of intermediate junk.  Having to be
very careful about permanent vs temporary data throughout all of that
seems like a recipe for bugs.

The existing code in RelationBuildPartitionDesc is already pretty good
about avoiding copying of data other than the output of
partition_bounds_create.  In fact, I think it's already close to what
you're suggesting other than that point.  So I think --- particularly
given that we need something we can back-patch into v11 --- that we
shouldn't try to do anything much more complicated than what Robert is
suggesting.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 18:49:06

On Wed, Mar 13, 2019 at 2:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-Mar-13, Robert Haas wrote:
> > On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > A bit, yes, but not overly so, and it's less fragile that not having
> > > such a protection.  Anything that allocates in CacheMemoryContext needs
> > > to be very careful anyway.
> >
> > True, but I think it's more fragile than either of the options I proposed.
>
> You do?  Unless I misunderstood, your options are:
>
> 1. (the patch you attached) create a temporary memory context that is
> used for everything, then at the end copy the good stuff to CacheMemCxt
> (or a sub-context thereof).  This still needs to copy.
>
> 2. create a temp memory context, do everything there, do retail freeing
> of everything we don't want, reparenting the context to CacheMemCxt.
> Hope we didn't forget to pfree anything.
>
> How is any of those superior to what I propose?

Well, *I* thought of it, so obviously it must be superior.  :-)

More seriously, your idea does seem better in some ways.  My #1
doesn't avoid the copy, but does kill the leaks.  My #2 avoids the
copy, but risks a different flavor of leakage.  Your idea also avoids
the copy and leaks in fewer cases than my #2.  In that sense, it is
the technically superior option.  However, it also requires more
memory context switches than either of my options, and I guess that
seems fragile to me in the sense that I think future code changes are
more likely to go wrong due to the complexity involved.  I might be
mistaken about that, though.

One other note is that, although the extra copy looks irksome, it's
probably not very significant from a performance point of view.  In a
non-CLOBBER_CACHE_ALWAYS build it doesn't happen frequently enough to
matter, and in a CLOBBER_CACHE_ALWAYS build everything is already so
slow that it still doesn't matter.  That's not the only consideration,
though.  Code which copies data structures might be buggy, or might
develop bugs in the future, and avoiding that copy would avoid
exposure to such bugs.  On the other hand, it's not really possible to
remove the copying without increasing the risk of leaking into the
long-lived context.

In some ways, I think this is all a natural outgrowth of the fact that
we rely on palloc() in so many places instead of forcing code to be
explicit about which memory context it intends to target.  Global
variables are considered harmful, and it's not that difficult to see
the connection between that general principle and present
difficulties.  However, I will not have time to write and debug a
patch reversing that choice between now and feature freeze.  :-)

I'm kinda inclined to go for the brute-force approach of slamming the
temporary context in there as in the patch I proposed upthread, which
should solve the immediate problem, and we can implement one of these
other ideas later if we want.  What do you think about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 18:59:28

On Wed, Mar 13, 2019 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OK, in that case it's definitely all the temporary data that gets created
> that is the problem.  I've not examined your patch in great detail but
> it looks plausible for fixing that.

Cool.

> I think that RelationBuildPartitionDesc could use some additional cleanup
> or at least better commenting.  In particular, it's neither documented nor
> obvious to the naked eye why rel->rd_partdesc mustn't get set till the
> very end.  As the complainant, I'm willing to go fix that, but do you want
> to push your patch first so it doesn't get broken?  Or I could include
> your patch in the cleanup.

Yeah, that probably makes sense, though it might be polite to wait
another hour or two to see if anyone wants to argue with that approach
further.

It seems kinda obvious to me why rel->rd_partdesc can't get set until
the end.  Isn't it just that you'd better not set a permanent pointer
to a data structure until you're past any code that might ERROR, which
is pretty much everything?  That principle applies to lots of
PostgreSQL code, not just this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 19:14:52

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 13, 2019 at 2:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think that RelationBuildPartitionDesc could use some additional cleanup
>> or at least better commenting.  In particular, it's neither documented nor
>> obvious to the naked eye why rel->rd_partdesc mustn't get set till the
>> very end.  As the complainant, I'm willing to go fix that, but do you want
>> to push your patch first so it doesn't get broken?  Or I could include
>> your patch in the cleanup.

> It seems kinda obvious to me why rel->rd_partdesc can't get set until
> the end.  Isn't it just that you'd better not set a permanent pointer
> to a data structure until you're past any code that might ERROR, which
> is pretty much everything?  That principle applies to lots of
> PostgreSQL code, not just this.

Yeah, but usually there's some comment pointing it out.  I also wonder
if there aren't corner-case bugs; it seems a bit bogus for example that
rd_pdcxt is created without any thought as to whether it might be set
already.  It's not clear whether this has been written with the
level of paranoia that's appropriate for messing with a relcache entry,
and some comments would make it a lot clearer (a) if that is true and
(b) what assumptions are implicitly being shared with relcache.c.

Meanwhile, who's going to take point on cleaning up rd_partcheck?
I don't really understand this code well enough to know whether that
can share one of the existing partitioning-related sub-contexts.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 19:47:15

On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, but usually there's some comment pointing it out.  I also wonder
> if there aren't corner-case bugs; it seems a bit bogus for example that
> rd_pdcxt is created without any thought as to whether it might be set
> already.  It's not clear whether this has been written with the
> level of paranoia that's appropriate for messing with a relcache entry,
> and some comments would make it a lot clearer (a) if that is true and
> (b) what assumptions are implicitly being shared with relcache.c.

Yeah, this is all pretty new code, and it probably has some bugs we
haven't found yet.

> Meanwhile, who's going to take point on cleaning up rd_partcheck?
> I don't really understand this code well enough to know whether that
> can share one of the existing partitioning-related sub-contexts.

Well, it would be nice if Amit Langote worked on it since he wrote the
original version of most of this code, but I'm sure he'll defer to you
if you feel the urge to work on it, or I can take a look at it (not
today).

To your question, I think it probably can't share a context.  Briefly,
rd_partkey can't change ever, except that when a partitioned relation
is in the process of being created it is briefly NULL; once it obtains
a value, that value cannot be changed.  If you want to range-partition
a list-partitioned table or something like that, you have to drop the
table and create a new one.  I think that's a perfectly acceptable
permanent limitation and I have no urge whatever to change it.
rd_partdesc changes when you attach or detach a child partition.
rd_partcheck is the reverse: it changes when you attach or detach this
partition to or from a parent.  It's probably helpful to think of the
case of a table with partitions each of which is itself partitioned --
the table at that middle level has to worry both about gaining or
losing children and about being ripped away from its parent.

As a parenthetical note, I observe that relcache.c seems to know
almost nothing about rd_partcheck.  rd_partkey and rd_partdesc both
have handling in RelationClearRelation(), but rd_partcheck does not,
and I suspect that's wrong.  So the problems are probably not confined
to the relcache-drop-time problem that you observed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 20:18:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Meanwhile, who's going to take point on cleaning up rd_partcheck?
>> I don't really understand this code well enough to know whether that
>> can share one of the existing partitioning-related sub-contexts.

> To your question, I think it probably can't share a context.  Briefly,
> rd_partkey can't change ever, except that when a partitioned relation
> is in the process of being created it is briefly NULL; once it obtains
> a value, that value cannot be changed.  If you want to range-partition
> a list-partitioned table or something like that, you have to drop the
> table and create a new one.  I think that's a perfectly acceptable
> permanent limitation and I have no urge whatever to change it.
> rd_partdesc changes when you attach or detach a child partition.
> rd_partcheck is the reverse: it changes when you attach or detach this
> partition to or from a parent.

Got it.  Yeah, it seems like the clearest and least bug-prone solution
is for those to be in three separate sub-contexts.  The only reason
I was trying to avoid that was the angle of how to back-patch into 11.
However, we can just add the additional context pointer field at the
end of the Relation struct in v11, and that should be good enough to
avoid ABI problems.

Off topic for the moment, since this clearly wouldn't be back-patch
material, but I'm starting to wonder if we should just have a context
for each relcache entry and get rid of most or all of the retail
cleanup logic in RelationDestroyRelation ...

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 20:38:59

On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Off topic for the moment, since this clearly wouldn't be back-patch
> material, but I'm starting to wonder if we should just have a context
> for each relcache entry and get rid of most or all of the retail
> cleanup logic in RelationDestroyRelation ...

I think that idea might have a lot of merit, but I haven't studied it closely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

13 March 2019, 20:50:53

On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Off topic for the moment, since this clearly wouldn't be back-patch
> > material, but I'm starting to wonder if we should just have a context
> > for each relcache entry and get rid of most or all of the retail
> > cleanup logic in RelationDestroyRelation ...
>
> I think that idea might have a lot of merit, but I haven't studied it closely.

It just occurred to me that one advantage of this would be that you
could see how much memory was being used by each relcache entry using
MemoryContextStats(), which seems super-appealing.  In fact, what
about getting rid of all allocations in CacheMemoryContext itself in
favor of some more specific context in each case?  That would make it
a lot clearer where to look for leaks -- or efficiencies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Andres Freund

Date:

13 March 2019, 20:56:56

Hi,

On 2019-03-13 16:50:53 -0400, Robert Haas wrote:
> On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Off topic for the moment, since this clearly wouldn't be back-patch
> > > material, but I'm starting to wonder if we should just have a context
> > > for each relcache entry and get rid of most or all of the retail
> > > cleanup logic in RelationDestroyRelation ...
> >
> > I think that idea might have a lot of merit, but I haven't studied it closely.
> 
> It just occurred to me that one advantage of this would be that you
> could see how much memory was being used by each relcache entry using
> MemoryContextStats(), which seems super-appealing.  In fact, what
> about getting rid of all allocations in CacheMemoryContext itself in
> favor of some more specific context in each case?  That would make it
> a lot clearer where to look for leaks -- or efficiencies.

But it might also make it frustrating to look at memory context dumps -
we'd suddenly have many many more memory context lines we'd displayed,
right? Wouldn't that often make the dump extremely long?

To be clear, I think the idea has merit. Just want to raise the above
point.

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 March 2019, 21:10:55

Andres Freund <andres@anarazel.de> writes:
> On 2019-03-13 16:50:53 -0400, Robert Haas wrote:
>> On Wed, Mar 13, 2019 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Mar 13, 2019 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Off topic for the moment, since this clearly wouldn't be back-patch
>>>> material, but I'm starting to wonder if we should just have a context
>>>> for each relcache entry and get rid of most or all of the retail
>>>> cleanup logic in RelationDestroyRelation ...

>> It just occurred to me that one advantage of this would be that you
>> could see how much memory was being used by each relcache entry using
>> MemoryContextStats(), which seems super-appealing.

> But it might also make it frustrating to look at memory context dumps -
> we'd suddenly have many many more memory context lines we'd displayed,
> right? Wouldn't that often make the dump extremely long?

There's already a mechanism in there to suppress child contexts after
100 or so, which would almost inevitably kick in on the relcache if we
did this.  So I don't believe we'd have a problem with the context dumps
getting too long --- more likely, the complaints would be the reverse.

My gut feeling is that right now relcache entries tend to be mas-o-menos
the same size, except for stuff that is *already* in sub-contexts, like
index and partition descriptors.  So I'm not that excited about this
adding useful info to context dumps.  I was just looking at it as a way
to make relcache entry cleanup simpler and less leak-prone.

Having said that, I do agree that CacheMemoryContext is too much of an
undifferentiated blob right now, and splitting it up seems like it'd be
good for accountability.  I'd definitely be +1 for a catcache vs. relcache
vs. other caches split.  You could imagine per-catcache contexts, too.
The main limiting factor here is that the per-context overhead could get
excessive.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Andres Freund

Date:

13 March 2019, 21:24:35

Hi,

On 2019-03-13 17:10:55 -0400, Tom Lane wrote:
> There's already a mechanism in there to suppress child contexts after
> 100 or so, which would almost inevitably kick in on the relcache if we
> did this.  So I don't believe we'd have a problem with the context dumps
> getting too long --- more likely, the complaints would be the reverse.

Well, that's two sides of the same coin.


> Having said that, I do agree that CacheMemoryContext is too much of an
> undifferentiated blob right now, and splitting it up seems like it'd be
> good for accountability.  I'd definitely be +1 for a catcache vs. relcache
> vs. other caches split.

That'd make a lot of sense.


> You could imagine per-catcache contexts, too.
> The main limiting factor here is that the per-context overhead could get
> excessive.

Yea, per relcache entry contexts seem like they'd get really expensive
fast.  Even per-catcache seems like it might be noticable additional
overhead for a new backend.

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

14 March 2019, 01:40:31

On 2019/03/14 5:18, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Meanwhile, who's going to take point on cleaning up rd_partcheck?
>>> I don't really understand this code well enough to know whether that
>>> can share one of the existing partitioning-related sub-contexts.
> 
>> To your question, I think it probably can't share a context.  Briefly,
>> rd_partkey can't change ever, except that when a partitioned relation
>> is in the process of being created it is briefly NULL; once it obtains
>> a value, that value cannot be changed.  If you want to range-partition
>> a list-partitioned table or something like that, you have to drop the
>> table and create a new one.  I think that's a perfectly acceptable
>> permanent limitation and I have no urge whatever to change it.
>> rd_partdesc changes when you attach or detach a child partition.
>> rd_partcheck is the reverse: it changes when you attach or detach this
>> partition to or from a parent.
> 
> Got it.  Yeah, it seems like the clearest and least bug-prone solution
> is for those to be in three separate sub-contexts.  The only reason
> I was trying to avoid that was the angle of how to back-patch into 11.
> However, we can just add the additional context pointer field at the
> end of the Relation struct in v11, and that should be good enough to
> avoid ABI problems.

Agree that rd_partcheck needs its own context as you have complained in
the past [1].

I think we'll need to back-patch this fix to PG 10 as well.  I've attached
patches for all 3 branches.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/22236.1523374067%40sss.pgh.pa.us

Attachment

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

14 March 2019, 16:16:51

On Wed, Mar 13, 2019 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, that probably makes sense, though it might be polite to wait
> another hour or two to see if anyone wants to argue with that approach
> further.

Hearing nobody, done.  If someone wants to argue more we can always
change it later.

I did not back-patch, because the code is in a different file in v11,
none of the hunks of the patch apply on v11, and v11 is not failing on
hyrax.  But feel free to do it if you feel strongly about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

14 March 2019, 16:36:20

Robert Haas <robertmhaas@gmail.com> writes:
> I did not back-patch, because the code is in a different file in v11,
> none of the hunks of the patch apply on v11, and v11 is not failing on
> hyrax.

Hmm, I wonder why not.  I suppose the answer is that
the leak is worse in HEAD than before, but how come?

I followed your reference to 898e5e329, and I've got to say that
the hunk it adds in relcache.c looks fishy as can be.  The argument
that the rd_pdcxt "will get cleaned up eventually" reads to me like
"this leaks memory like a sieve", especially in the repeated-rebuild
scenario which is exactly what CLOBBER_CACHE_ALWAYS would provoke.
Probably the only thing that keeps it from being effectively a
session-lifespan leak is that CCA will generally result in relcache
entries being flushed entirely as soon as their refcount goes to 0.
Still, because of that, I wouldn't think it'd move the needle very
much on a CCA animal; so my guess is that there's something else.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

14 March 2019, 17:18:50

On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, I wonder why not.  I suppose the answer is that
> the leak is worse in HEAD than before, but how come?

I'd like to know that, too, but right now I don't.

> I followed your reference to 898e5e329, and I've got to say that
> the hunk it adds in relcache.c looks fishy as can be.  The argument
> that the rd_pdcxt "will get cleaned up eventually" reads to me like
> "this leaks memory like a sieve", especially in the repeated-rebuild
> scenario which is exactly what CLOBBER_CACHE_ALWAYS would provoke.
> Probably the only thing that keeps it from being effectively a
> session-lifespan leak is that CCA will generally result in relcache
> entries being flushed entirely as soon as their refcount goes to 0.
> Still, because of that, I wouldn't think it'd move the needle very
> much on a CCA animal; so my guess is that there's something else.

I'm a little uncertain of that logic, too, but keep in mind that if
keep_partdesc is true, we're just going to throw away the newly-build
data and keep the old stuff.  So I think that in order for this to
actually be a problem, you would have to have other sessions that
repeatedly alter the partitiondesc via ATTACH PARTITION, and at the
same time, the current session would have to keep on reading it.  But
you also have to never add a partition while the local session is
between queries, because if you do, rebuild will be false and we'll
blow away the old entry in its entirety and any old contexts that are
hanging off of it will be destroyed as well.  So it seems a little
ticklish to me to figure out a realistic scenario in which we actually
leak enough memory to matter here, but maybe you have an idea of which
I have not thought.

We could certainly do better - just refcount each PartitionDesc.  Have
the relcache entry hold a refcount on the PartitionDesc and have a
PartitionDirectory hold a refcount on the PartitionDesc; then, any
time the refcount goes to 0, you can immediately destroy the
PartitionDesc.  Or, simpler, arrange things so that when the
refcache's refcount goes to 0, any old PartitionDescs that are still
hanging off of the latest one get destroyed then, not later.  It's
just a question of whether it's really worth the code, or whether
we're fixing imaginary bugs by adding code that might have real ones.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

14 March 2019, 22:08:08

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, I wonder why not.  I suppose the answer is that
>> the leak is worse in HEAD than before, but how come?

> I'd like to know that, too, but right now I don't.

I poked at this some more, by dint of adding some memory context usage
tracking and running it under CLOBBER_CACHE_ALWAYS --- specifically,
I looked at the update.sql test script, which is one of the two that
is giving hyrax trouble.  (Conveniently, that one runs successfully
by itself, without need for the rest of the regression suite.)

I found that even with 2455ab488, there still seemed to be an
unreasonable amount of MessageContext bloat during some of the queries,
on the order of 50MB in some of them.  Investigating more closely,
I found that 2455ab488 has a couple of simple oversights: it's still
calling partition_bounds_create in the caller's context, allowing
whatever that builds or leaks to be a MessageContext leak.  And it's
still calling get_rel_relkind() in the rd_pdcxt context, potentially
leaking a *lot* of stuff into that normally-long-lived context, since
that will result in fresh catcache (and thence potentially relcache)
loads in some cases.

But fixing that didn't seem to move the needle very much for update.sql.
With some further investigation, I identified a few other main
contributors to the bloat:

    RelationBuildTriggers
    RelationBuildRowSecurity
    RelationBuildPartitionKey

Are you noticing a pattern yet?  The real issue here is that we have
this its-okay-to-leak-in-the-callers-context mindset throughout
relcache.c, and when CLOBBER_CACHE_ALWAYS causes us to reload relcache
entries a lot, that adds up fast.  The partition stuff makes this worse,
I think, primarily just because it increases the number of tables we
touch in a typical query.

What I'm thinking, therefore, is that 2455ab488 had the right idea but
didn't take it far enough.  We should remove the temp-context logic it
added to RelationBuildPartitionDesc and instead put that one level up,
in RelationBuildDesc, where the same temp context can serve all of these
leak-prone sub-facilities.

Possibly it'd make sense to conditionally compile this so that we only
do it in a CLOBBER_CACHE_ALWAYS build.  I'm not very sure about that,
but arguably in a normal build the overhead of making and destroying
a context would outweigh the cost of the leaked memory.  The main
argument I can think of for doing it all the time is that having memory
allocation work noticeably differently in CCA builds than normal ones
seems like a recipe for masking normal-mode bugs from the CCA animals.
But that's only a guess not proven fact; it's also possible that having
two different behaviors would expose bugs we'd otherwise have a hard
time detecting, such as continuing to rely on the "temporary" data
structures longer than we should.

(This is all independent of the other questions raised on this and
nearby threads.)

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Heikki Linnakangas

Date:

15 March 2019, 07:32:22

On 15/03/2019 00:08, Tom Lane wrote:
> What I'm thinking, therefore, is that 2455ab488 had the right idea but
> didn't take it far enough.  We should remove the temp-context logic it
> added to RelationBuildPartitionDesc and instead put that one level up,
> in RelationBuildDesc, where the same temp context can serve all of these
> leak-prone sub-facilities.
> 
> Possibly it'd make sense to conditionally compile this so that we only
> do it in a CLOBBER_CACHE_ALWAYS build.  I'm not very sure about that,
> but arguably in a normal build the overhead of making and destroying
> a context would outweigh the cost of the leaked memory.  The main
> argument I can think of for doing it all the time is that having memory
> allocation work noticeably differently in CCA builds than normal ones
> seems like a recipe for masking normal-mode bugs from the CCA animals.

Having CLOBBER_CACHE_ALWAYS behave differently sounds horrible.

We maintain a free list of AllocSetContexts nowadays, so creating a 
short-lived context should be pretty cheap. Or if it's still too 
expensive, we could create one short-lived context as a child of 
TopMemoryContext, and reuse that on every call, resetting it at the end 
of the function.

- Heikki

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

15 March 2019, 12:22:54

On Fri, Mar 15, 2019 at 3:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> We maintain a free list of AllocSetContexts nowadays, so creating a
> short-lived context should be pretty cheap. Or if it's still too
> expensive, we could create one short-lived context as a child of
> TopMemoryContext, and reuse that on every call, resetting it at the end
> of the function.

Relcache rebuild is reeentrant, I think, so we have to be careful about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

15 March 2019, 13:35:32

On Thu, Mar 14, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I found that even with 2455ab488, there still seemed to be an
> unreasonable amount of MessageContext bloat during some of the queries,
> on the order of 50MB in some of them.  Investigating more closely,
> I found that 2455ab488 has a couple of simple oversights: it's still
> calling partition_bounds_create in the caller's context, allowing
> whatever that builds or leaks to be a MessageContext leak.

Oops.

> And it's
> still calling get_rel_relkind() in the rd_pdcxt context, potentially
> leaking a *lot* of stuff into that normally-long-lived context, since
> that will result in fresh catcache (and thence potentially relcache)
> loads in some cases.

That's really unfortunate.  I know CLOBBER_CACHE_ALWAYS testing has a
lot of value, but get_rel_relkind() is the sort of function that
developers need to be able to call without worrying with creating a
massive memory leak.  Maybe the only time it is really going to get
called enough to matter is from within the cache invalidation
machinery itself, in which case your idea will fix it.  But I'm not
sure about that.

> What I'm thinking, therefore, is that 2455ab488 had the right idea but
> didn't take it far enough.  We should remove the temp-context logic it
> added to RelationBuildPartitionDesc and instead put that one level up,
> in RelationBuildDesc, where the same temp context can serve all of these
> leak-prone sub-facilities.

Yeah, that sounds good.

> Possibly it'd make sense to conditionally compile this so that we only
> do it in a CLOBBER_CACHE_ALWAYS build.  I'm not very sure about that,
> but arguably in a normal build the overhead of making and destroying
> a context would outweigh the cost of the leaked memory.  The main
> argument I can think of for doing it all the time is that having memory
> allocation work noticeably differently in CCA builds than normal ones
> seems like a recipe for masking normal-mode bugs from the CCA animals.
> But that's only a guess not proven fact; it's also possible that having
> two different behaviors would expose bugs we'd otherwise have a hard
> time detecting, such as continuing to rely on the "temporary" data
> structures longer than we should.

I lean toward thinking it makes more sense to be consistent, but I'm
not sure that's right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

15 March 2019, 13:39:47

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 15/03/2019 00:08, Tom Lane wrote:
>> Possibly it'd make sense to conditionally compile this so that we only
>> do it in a CLOBBER_CACHE_ALWAYS build.  I'm not very sure about that,
>> but arguably in a normal build the overhead of making and destroying
>> a context would outweigh the cost of the leaked memory.  The main
>> argument I can think of for doing it all the time is that having memory
>> allocation work noticeably differently in CCA builds than normal ones
>> seems like a recipe for masking normal-mode bugs from the CCA animals.

> Having CLOBBER_CACHE_ALWAYS behave differently sounds horrible.

I'm not entirely convinced of that.  It's already pretty different...

> We maintain a free list of AllocSetContexts nowadays, so creating a 
> short-lived context should be pretty cheap. Or if it's still too 
> expensive, we could create one short-lived context as a child of 
> TopMemoryContext, and reuse that on every call, resetting it at the end 
> of the function.

RelationBuildDesc potentially recurses, so a single context won't do.
Perhaps you're right that the context freelist would make this cheap
enough to not matter, but I'm not sure of that either.

What I'm inclined to do to get the buildfarm pressure off is to set
things up so that by default we do this only in CCA builds, but there's
a #define you can set from the compile command line to override that
decision in either direction.  Then, if somebody wants to investigate
the performance effects in more detail, they could twiddle it easily.
Depending on the results, we could alter the default policy.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

15 March 2019, 13:54:19

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 14, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> And it's
>> still calling get_rel_relkind() in the rd_pdcxt context, potentially
>> leaking a *lot* of stuff into that normally-long-lived context, since
>> that will result in fresh catcache (and thence potentially relcache)
>> loads in some cases.

> That's really unfortunate.  I know CLOBBER_CACHE_ALWAYS testing has a
> lot of value, but get_rel_relkind() is the sort of function that
> developers need to be able to call without worrying with creating a
> massive memory leak.

I don't find that to be a concern.  The bug here is that random code is
being called while CurrentMemoryContext is pointing to a long-lived
context, and that's just a localized violation of a well understood coding
convention.  I don't think that that means we need some large fire drill
in response.

>> What I'm thinking, therefore, is that 2455ab488 had the right idea but
>> didn't take it far enough.  We should remove the temp-context logic it
>> added to RelationBuildPartitionDesc and instead put that one level up,
>> in RelationBuildDesc, where the same temp context can serve all of these
>> leak-prone sub-facilities.

> Yeah, that sounds good.

OK, I'll have a go at that today.

>> Possibly it'd make sense to conditionally compile this so that we only
>> do it in a CLOBBER_CACHE_ALWAYS build.  I'm not very sure about that,

> I lean toward thinking it makes more sense to be consistent, but I'm
> not sure that's right.

My feeling right now is that the its-okay-to-leak policy has been in
place for decades and we haven't detected a problem with it before.
Hence, doing that differently in normal builds should require some
positive evidence that it'd be beneficial (or at least not a net loss).
I don't have the time or interest to collect such evidence.  But I'm
happy to set up the patch to make it easy for someone else to do so,
if anyone is sufficiently excited about this to do the work.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

15 March 2019, 18:18:04

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 14, 2019 at 12:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, I wonder why not.  I suppose the answer is that
>> the leak is worse in HEAD than before, but how come?

> I'd like to know that, too, but right now I don't.

BTW, after closer study of 898e5e329 I have a theory as to why it
made things worse for CCA animals: it causes relcache entries to be
held open (via incremented refcounts) throughout planning, which
they were not before.  This means that, during each of the many
RelationCacheInvalidate events that will occur during a planning
cycle, we have to rebuild those relcache entries; previously they
were just dropped once and that was the end of it till execution.

You noted correctly that the existing PartitionDesc structure would
generally get preserved during each reload --- but that has exactly
zip to do with the amount of transient space consumed by execution
of RelationBuildDesc.  We still build a transient PartitionDesc
that we then elect not to install.  And besides that there's all the
not-partitioning-related stuff that RelationBuildDesc can leak.

This is probably largely irrelevant for non-CCA situations, since
we don't expect forced relcache invals to occur often otherwise.
But it seems to fit the facts, particularly that hyrax is only dying
on queries that involve moderately large partitioning structures
and hence quite a few relations pinned by PartitionDirectory entries.

I remain of the opinion that we ought not have PartitionDirectories
pinning relcache entries ... but this particular effect isn't really
significant to that argument, I think.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

15 March 2019, 18:39:54

On Fri, Mar 15, 2019 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, after closer study of 898e5e329 I have a theory as to why it
> made things worse for CCA animals: it causes relcache entries to be
> held open (via incremented refcounts) throughout planning, which
> they were not before.  This means that, during each of the many
> RelationCacheInvalidate events that will occur during a planning
> cycle, we have to rebuild those relcache entries; previously they
> were just dropped once and that was the end of it till execution.

That makes sense.  Thanks for looking at it, and for the insight.

> You noted correctly that the existing PartitionDesc structure would
> generally get preserved during each reload --- but that has exactly
> zip to do with the amount of transient space consumed by execution
> of RelationBuildDesc.  We still build a transient PartitionDesc
> that we then elect not to install.  And besides that there's all the
> not-partitioning-related stuff that RelationBuildDesc can leak.

Right.  So it's just that we turned a bunch of rebuild = false
situations into rebuild = true situations.  I think.

> This is probably largely irrelevant for non-CCA situations, since
> we don't expect forced relcache invals to occur often otherwise.
> But it seems to fit the facts, particularly that hyrax is only dying
> on queries that involve moderately large partitioning structures
> and hence quite a few relations pinned by PartitionDirectory entries.
>
> I remain of the opinion that we ought not have PartitionDirectories
> pinning relcache entries ... but this particular effect isn't really
> significant to that argument, I think.

Cool.  I would still like to hear more about why you think that.  I
agree that the pinning is not great, but copying moderately large
partitioning structures that are only likely to get more complex in
future releases also does not seem great, so I think it may be the
least of evils.  You, on the other hand, seem to have a rather
visceral hatred for it.  That may be based on facts with which I am
not acquainted or it may be that we have a different view of what good
design looks like.  If it's the latter, we may just have to agree to
disagree and see if other people want to opine, but if it's the
former, I'd like to know those facts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

15 March 2019, 19:45:28

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 15, 2019 at 2:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, after closer study of 898e5e329 I have a theory as to why it
>> made things worse for CCA animals: it causes relcache entries to be
>> held open (via incremented refcounts) throughout planning, which
>> they were not before.  This means that, during each of the many
>> RelationCacheInvalidate events that will occur during a planning
>> cycle, we have to rebuild those relcache entries; previously they
>> were just dropped once and that was the end of it till execution.

> Right.  So it's just that we turned a bunch of rebuild = false
> situations into rebuild = true situations.  I think.

More to the point, we turned *one* rebuild = false situation into
a bunch of rebuild = true situations.  I haven't studied it closely,
but I think a CCA animal would probably see O(N^2) rebuild = true
invocations in a query with N partitions, since each time
expand_partitioned_rtentry opens another child table, we'll incur
an AcceptInvalidationMessages call which leads to forced rebuilds
of the previously-pinned rels.  In non-CCA situations, almost always
nothing happens with the previously-examined relcache entries.

>> I remain of the opinion that we ought not have PartitionDirectories
>> pinning relcache entries ... but this particular effect isn't really
>> significant to that argument, I think.

> Cool.  I would still like to hear more about why you think that.  I
> agree that the pinning is not great, but copying moderately large
> partitioning structures that are only likely to get more complex in
> future releases also does not seem great, so I think it may be the
> least of evils.  You, on the other hand, seem to have a rather
> visceral hatred for it.

I agree that copying data isn't great.  What I don't agree is that this
solution is better.  In particular, copying data out of the relcache
rather than expecting the relcache to hold still over long periods
is the way we've done things everywhere else, cf RelationGetIndexList,
RelationGetStatExtList, RelationGetIndexExpressions,
RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
for a patch randomly deciding to do things differently on the basis of an
unsupported-by-evidence argument that it might cost too much to copy the
data.  If we're going to make a push to reduce the amount of copying of
that sort that we do, it should be a separately (and carefully) designed
thing that applies to all the relcache substructures that have the issue,
not one-off hacks that haven't been reviewed thoroughly.

I especially don't care for the much-less-than-half-baked kluge of
chaining the old rd_pdcxt onto the new one and hoping that it will go away
at a suitable time.  Yeah, that will work all right as long as it's not
stressed very hard, but we've found repeatedly that users will find a way
to overstress places where we're sloppy about resource management.
In this case, since the leak is potentially of session lifespan, I doubt
it's even very hard to cause unbounded growth in relcache space usage ---
you just have to do $whatever over and over.  Let's see ...

regression=# create table parent (a text, b int) partition by list (a);
CREATE TABLE
regression=# create table child (a text, b int);
CREATE TABLE
regression=# do $$
regression$# begin
regression$# for i in 1..10000000 loop
regression$# alter table parent attach partition child for values in ('x');
regression$# alter table parent detach partition child;
regression$# end loop;
regression$# end $$;

After letting that run for awhile, I've got enough rd_pdcxts to send
MemoryContextStats into a nasty loop.

  CacheMemoryContext: 4259904 total in 11 blocks; 1501120 free (19 chunks); 2758784 used
    partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
      partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
        partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
          partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
            partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
              partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                  partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                    partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                      partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                        partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                          partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                            partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                              partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                                partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                                  partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
                                    partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
                                      partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used:
parent
                                        partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used:
parent
                                          partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used:
parent
... yadda yadda ...

The leak isn't terribly fast, since this is an expensive thing to be
doing, but it's definitely leaking.

> That may be based on facts with which I am
> not acquainted or it may be that we have a different view of what good
> design looks like.

I don't think solving the same problem in multiple ways constitutes
good design, barring compelling reasons for the solutions being different,
which you haven't presented.  Solving the same problem in multiple ways
some of which are buggy is *definitely* not good design.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

15 March 2019, 21:41:47

On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> More to the point, we turned *one* rebuild = false situation into
> a bunch of rebuild = true situations.  I haven't studied it closely,
> but I think a CCA animal would probably see O(N^2) rebuild = true
> invocations in a query with N partitions, since each time
> expand_partitioned_rtentry opens another child table, we'll incur
> an AcceptInvalidationMessages call which leads to forced rebuilds
> of the previously-pinned rels.  In non-CCA situations, almost always
> nothing happens with the previously-examined relcache entries.

That's rather unfortunate.  I start to think that clobbering the cache
"always" is too hard a line.

> I agree that copying data isn't great.  What I don't agree is that this
> solution is better.  In particular, copying data out of the relcache
> rather than expecting the relcache to hold still over long periods
> is the way we've done things everywhere else, cf RelationGetIndexList,
> RelationGetStatExtList, RelationGetIndexExpressions,
> RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
> RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
> for a patch randomly deciding to do things differently on the basis of an
> unsupported-by-evidence argument that it might cost too much to copy the
> data.  If we're going to make a push to reduce the amount of copying of
> that sort that we do, it should be a separately (and carefully) designed
> thing that applies to all the relcache substructures that have the issue,
> not one-off hacks that haven't been reviewed thoroughly.

That's not an unreasonable argument.  On the other hand, if you never
try new stuff, you lose opportunities to explore things that might
turn out to be better and worth adopting more widely.

I am not very convinced that it makes sense to lump something like
RelationGetIndexAttrBitmap in with something like
RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
single Bitmapset, whereas the data structure RelationGetPartitionDesc
is vastly larger and more complex.  You can't say "well, if it's best
to copy 32 bytes of data out of the relcache every time we need it, it
must also be right to copy 10k or 100k of data out of the relcache
every time we need it."

There is another difference as well: there's a good chance that
somebody is going to want to mutate a Bitmapset, whereas they had
BETTER NOT think that they can mutate the PartitionDesc.  So returning
an uncopied Bitmapset is kinda risky in a way that returning an
uncopied PartitionDesc is not.

If we want an at-least-somewhat unified solution here, I think we need
to bite the bullet and make the planner hold a reference to the
relcache throughout planning.  (The executor already keeps it open, I
believe.) Otherwise, how is the relcache supposed to know when it can
throw stuff away and when it can't?  The only alternative seems to be
to have each subsystem hold its own reference count, as I did with the
PartitionDirectory stuff, which is not something we'd want to scale
out.

> I especially don't care for the much-less-than-half-baked kluge of
> chaining the old rd_pdcxt onto the new one and hoping that it will go away
> at a suitable time.

It will go away at a suitable time, but maybe not at the soonest
suitable time.  It wouldn't be hard to improve that, though.  The
easiest thing to do, I think, would be to have an rd_oldpdcxt and
stuff the old context there.  If there already is one there, parent
the new one under it.  When RelationDecrementReferenceCount reduces
the reference count to zero, destroy anything found in rd_oldpdcxt.
With that change, things get destroyed at the earliest time at which
we know the old things aren't referenced, instead of the earliest time
at which they are not referenced + an invalidation arrives.

> regression=# create table parent (a text, b int) partition by list (a);
> CREATE TABLE
> regression=# create table child (a text, b int);
> CREATE TABLE
> regression=# do $$
> regression$# begin
> regression$# for i in 1..10000000 loop
> regression$# alter table parent attach partition child for values in ('x');
> regression$# alter table parent detach partition child;
> regression$# end loop;
> regression$# end $$;

Interesting example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

08 April 2019, 08:56:53

On 2019/03/14 10:40, Amit Langote wrote:
> On 2019/03/14 5:18, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Meanwhile, who's going to take point on cleaning up rd_partcheck?
>>>> I don't really understand this code well enough to know whether that
>>>> can share one of the existing partitioning-related sub-contexts.
>>
>>> To your question, I think it probably can't share a context.  Briefly,
>>> rd_partkey can't change ever, except that when a partitioned relation
>>> is in the process of being created it is briefly NULL; once it obtains
>>> a value, that value cannot be changed.  If you want to range-partition
>>> a list-partitioned table or something like that, you have to drop the
>>> table and create a new one.  I think that's a perfectly acceptable
>>> permanent limitation and I have no urge whatever to change it.
>>> rd_partdesc changes when you attach or detach a child partition.
>>> rd_partcheck is the reverse: it changes when you attach or detach this
>>> partition to or from a parent.
>>
>> Got it.  Yeah, it seems like the clearest and least bug-prone solution
>> is for those to be in three separate sub-contexts.  The only reason
>> I was trying to avoid that was the angle of how to back-patch into 11.
>> However, we can just add the additional context pointer field at the
>> end of the Relation struct in v11, and that should be good enough to
>> avoid ABI problems.
> 
> Agree that rd_partcheck needs its own context as you have complained in
> the past [1].
> 
> I think we'll need to back-patch this fix to PG 10 as well.  I've attached
> patches for all 3 branches.
> 
> Thanks,
> Amit
> 
> [1] https://www.postgresql.org/message-id/22236.1523374067%40sss.pgh.pa.us

Should I add this patch to Older Bugs [1]?

Thanks,
Amit

[1] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

08 April 2019, 13:59:21

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> Should I add this patch to Older Bugs [1]?

Yeah, it's an open issue IMO.  I think we've been focusing on getting
as many feature patches done as we could during the CF, but now it's
time to start mopping up problems like this one.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

08 April 2019, 14:40:41

On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> > Should I add this patch to Older Bugs [1]?
>
> Yeah, it's an open issue IMO.  I think we've been focusing on getting
> as many feature patches done as we could during the CF, but now it's
> time to start mopping up problems like this one.

Do you have any further thoughts based on my last response?

Does anyone else wish to offer opinions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

09 April 2019, 04:00:41

On 2019/04/08 22:59, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> Should I add this patch to Older Bugs [1]?
> 
> Yeah, it's an open issue IMO.  I think we've been focusing on getting
> as many feature patches done as we could during the CF, but now it's
> time to start mopping up problems like this one.

Thanks, done.

Regards,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

09 April 2019, 04:38:33

On 2019/03/16 6:41, Robert Haas wrote:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that copying data isn't great.  What I don't agree is that this
>> solution is better.  In particular, copying data out of the relcache
>> rather than expecting the relcache to hold still over long periods
>> is the way we've done things everywhere else, cf RelationGetIndexList,
>> RelationGetStatExtList, RelationGetIndexExpressions,
>> RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
>> RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
>> for a patch randomly deciding to do things differently on the basis of an
>> unsupported-by-evidence argument that it might cost too much to copy the
>> data.  If we're going to make a push to reduce the amount of copying of
>> that sort that we do, it should be a separately (and carefully) designed
>> thing that applies to all the relcache substructures that have the issue,
>> not one-off hacks that haven't been reviewed thoroughly.
> 
> That's not an unreasonable argument.  On the other hand, if you never
> try new stuff, you lose opportunities to explore things that might
> turn out to be better and worth adopting more widely.
> 
> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."
> 
> There is another difference as well: there's a good chance that
> somebody is going to want to mutate a Bitmapset, whereas they had
> BETTER NOT think that they can mutate the PartitionDesc.  So returning
> an uncopied Bitmapset is kinda risky in a way that returning an
> uncopied PartitionDesc is not.
> 
> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?  The only alternative seems to be
> to have each subsystem hold its own reference count, as I did with the
> PartitionDirectory stuff, which is not something we'd want to scale
> out.

Fwiw, I'd like to vote for planner holding the relcache reference open
throughout planning.  The planner could then reference the various
substructures directly (using a non-copying accessor), except those that
something in the planner might want to modify, in which case use the
copying accessor.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Michael Paquier

Date:

10 April 2019, 06:42:20

On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote:
> On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> Yeah, it's an open issue IMO.  I think we've been focusing on getting
>> as many feature patches done as we could during the CF, but now it's
>> time to start mopping up problems like this one.

Please note that it is registered as an older bug and not an open
item.

> Do you have any further thoughts based on my last response?

So your last response is that:
https://www.postgresql.org/message-id/CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com
And what are you proposing as patch?  Perhaps something among those
lines?
https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp

> Does anyone else wish to offer opinions?

It seems to me that Tom's argument to push in the way relcache
information is handled by copying its contents sounds sensible to me.
That's not perfect, but it is consistent with what exists (note
PublicationActions for a rather-still-not-much-complex example of
structure which gets copied from the relcache).
--
Michael

Attachment

signature.asc

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

10 April 2019, 08:03:21

Hi,

On 2019/04/10 15:42, Michael Paquier wrote:
> On Mon, Apr 08, 2019 at 10:40:41AM -0400, Robert Haas wrote:
>> On Mon, Apr 8, 2019 at 9:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>>> Yeah, it's an open issue IMO.  I think we've been focusing on getting
>>> as many feature patches done as we could during the CF, but now it's
>>> time to start mopping up problems like this one.
> 
> Please note that it is registered as an older bug and not an open
> item.

The problem lies in all branches that have partitioning, so it should be
listed under Older Bugs, right?  You may have noticed that I posted
patches for all branches down to 10.

>> Do you have any further thoughts based on my last response?
> 
> So your last response is that:
> https://www.postgresql.org/message-id/CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com
> And what are you proposing as patch?  Perhaps something among those
> lines?
> https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp

AFAIK, the patch there isn't meant to solve problems discussed at the 1st
link.  It's meant to fix poor cache memory management of partition
constraint expression trees, which seems to be a separate issue from the
PartitionDesc memory management issue (the latter is the main topic of
this thread.)  Problem with partition constraint expression trees was only
mentioned in passing in this thread [1], although it had also come up in
the past, as I said when posting the patch.

>> Does anyone else wish to offer opinions?
> 
> It seems to me that Tom's argument to push in the way relcache
> information is handled by copying its contents sounds sensible to me.
> That's not perfect, but it is consistent with what exists (note
> PublicationActions for a rather-still-not-much-complex example of
> structure which gets copied from the relcache).

I gave my vote for direct access of unchangeable relcache substructures
(TupleDesc, PartitionDesc, etc.), because they're accessed during the
planning of every user query and fairly expensive to copy compared to list
of indexes or PublicationActions that you're citing.  To further my point
a bit, I wonder why PublicationActions needs to be copied out of relcache
at all?  Looking at its usage in execReplication.c, it seems we can do
fine without copying, because we are holding both a lock and a relcache
reference on the replication target relation during the entirety of
apply_handle_insert(), which means that information can't change under us,
neither logically nor physically.

Also, to reiterate what I think was one of Robert's points upthread [2],
the reason for requiring some code to copy the relcache substructures out
of relcache should be that the caller might want change its content; for
example, planner or its hooks may want to add/remove an index to/from the
list of indexes copied from the relcache.  The reason for copying should
not be that relcache content may change under us despite holding a lock
and relcache reference.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/7961.1552498252%40sss.pgh.pa.us

[2]
https://www.postgresql.org/message-id/CA%2BTgmoa5rT%2BZR%2BVv%2Bq1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw%40mail.gmail.com

Re: hyrax vs. RelationBuildPartitionDesc

From

Michael Paquier

Date:

10 April 2019, 11:09:22

On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
> The problem lies in all branches that have partitioning, so it should be
> listed under Older Bugs, right?  You may have noticed that I posted
> patches for all branches down to 10.

I have noticed.  The message from Tom upthread outlined that an open
item should be added, but this is not one.  That's what I wanted to
emphasize.  Thanks for making it clearer.
--
Michael

Attachment

signature.asc

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

12 April 2019, 15:42:24

Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
>> The problem lies in all branches that have partitioning, so it should be
>> listed under Older Bugs, right?  You may have noticed that I posted
>> patches for all branches down to 10.

> I have noticed.  The message from Tom upthread outlined that an open
> item should be added, but this is not one.  That's what I wanted to
> emphasize.  Thanks for making it clearer.

To clarify a bit: there's more than one problem here.  Amit added an
open item about pre-existing leaks related to rd_partcheck.  (I'm going
to review and hopefully push his fix for that today.)  However, there's
a completely separate leak associated with mismanagement of rd_pdcxt,
as I showed in [1], and it doesn't seem like we have consensus about
what to do about that one.  AFAIK that's a new bug in 12 (caused by
898e5e329) and so it ought to be in the main open-items list.

This thread has discussed a bunch of possible future changes like
trying to replace copying of relcache-provided data structures
with reference-counting, but I don't think any such thing could be
v12 material at this point.  We do need to fix the newly added
leak though.

            regards, tom lane

[1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

12 April 2019, 19:47:19

Robert Haas <robertmhaas@gmail.com> writes:
> As a parenthetical note, I observe that relcache.c seems to know
> almost nothing about rd_partcheck.  rd_partkey and rd_partdesc both
> have handling in RelationClearRelation(), but rd_partcheck does not,
> and I suspect that's wrong.  So the problems are probably not confined
> to the relcache-drop-time problem that you observed.

I concluded that that's not parenthetical but pretty relevant...

Attached is a revised version of Amit's patch at [1] that makes these
data structures be treated more similarly.  I also added some Asserts
and comment improvements to address the complaints I made upthread about
under-documentation of all this logic.

I also cleaned up the problem the code had with failing to distinguish
"partcheck list is NIL" from "partcheck list hasn't been computed yet".
For a relation with no such constraints, generate_partition_qual would do
the full pushups every time.  I'm not sure if the case actually occurs
commonly enough that that's a performance problem, but failure to account
for it made my added assertions fall over :-( and I thought fixing it
was better than weakening the assertions.

I haven't made back-patch versions yet.  I'd expect they could be
substantially the same, except the two new fields have to go at the
end of struct RelationData to avoid ABI breaks.

Oh: we might also need some change in RelationCacheInitializePhase3,
depending on the decision about [2].

            regards, tom lane

[1] https://www.postgresql.org/message-id/036852f2-ba7f-7a1f-21c6-00bc3515eda3@lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/5706.1555093031@sss.pgh.pa.us

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index e436d1e..4d6595b 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -49,10 +49,13 @@ typedef struct PartitionDirectoryEntry

 /*
  * RelationBuildPartitionDesc
- *        Form rel's partition descriptor
+ *        Form rel's partition descriptor, and store in relcache entry
  *
- * Not flushed from the cache by RelationClearRelation() unless changed because
- * of addition or removal of partition.
+ * Note: the descriptor won't be flushed from the cache by
+ * RelationClearRelation() unless it's changed because of
+ * addition or removal of a partition.  Hence, code holding a lock
+ * that's sufficient to prevent that can assume that rd_partdesc
+ * won't change underneath it.
  */
 void
 RelationBuildPartitionDesc(Relation rel)
@@ -173,7 +176,19 @@ RelationBuildPartitionDesc(Relation rel)
         ++i;
     }

-    /* Now build the actual relcache partition descriptor */
+    /* Assert we aren't about to leak any old data structure */
+    Assert(rel->rd_pdcxt == NULL);
+    Assert(rel->rd_partdesc == NULL);
+
+    /*
+     * Now build the actual relcache partition descriptor.  Note that the
+     * order of operations here is fairly critical.  If we fail partway
+     * through this code, we won't have leaked memory because the rd_pdcxt is
+     * attached to the relcache entry immediately, so it'll be freed whenever
+     * the entry is rebuilt or destroyed.  However, we don't assign to
+     * rd_partdesc until the cached data structure is fully complete and
+     * valid, so that no other code might try to use it.
+     */
     rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
                                           "partition descriptor",
                                           ALLOCSET_SMALL_SIZES);
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 8f43d68..29b4a76 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -41,11 +41,11 @@ static List *generate_partition_qual(Relation rel);

 /*
  * RelationBuildPartitionKey
- *        Build and attach to relcache partition key data of relation
+ *        Build partition key data of relation, and attach to relcache
  *
  * Partitioning key data is a complex structure; to avoid complicated logic to
  * free individual elements whenever the relcache entry is flushed, we give it
- * its own memory context, child of CacheMemoryContext, which can easily be
+ * its own memory context, a child of CacheMemoryContext, which can easily be
  * deleted on its own.  To avoid leaking memory in that context in case of an
  * error partway through this function, the context is initially created as a
  * child of CurTransactionContext and only re-parented to CacheMemoryContext
@@ -144,6 +144,7 @@ RelationBuildPartitionKey(Relation relation)
         MemoryContextSwitchTo(oldcxt);
     }

+    /* Allocate assorted arrays in the partkeycxt, which we'll fill below */
     oldcxt = MemoryContextSwitchTo(partkeycxt);
     key->partattrs = (AttrNumber *) palloc0(key->partnatts * sizeof(AttrNumber));
     key->partopfamily = (Oid *) palloc0(key->partnatts * sizeof(Oid));
@@ -151,8 +152,6 @@ RelationBuildPartitionKey(Relation relation)
     key->partsupfunc = (FmgrInfo *) palloc0(key->partnatts * sizeof(FmgrInfo));

     key->partcollation = (Oid *) palloc0(key->partnatts * sizeof(Oid));
-
-    /* Gather type and collation info as well */
     key->parttypid = (Oid *) palloc0(key->partnatts * sizeof(Oid));
     key->parttypmod = (int32 *) palloc0(key->partnatts * sizeof(int32));
     key->parttyplen = (int16 *) palloc0(key->partnatts * sizeof(int16));
@@ -235,6 +234,10 @@ RelationBuildPartitionKey(Relation relation)

     ReleaseSysCache(tuple);

+    /* Assert that we're not leaking any old data during assignments below */
+    Assert(relation->rd_partkeycxt == NULL);
+    Assert(relation->rd_partkey == NULL);
+
     /*
      * Success --- reparent our context and make the relcache point to the
      * newly constructed key
@@ -305,11 +308,9 @@ get_partition_qual_relid(Oid relid)
  * Generate partition predicate from rel's partition bound expression. The
  * function returns a NIL list if there is no predicate.
  *
- * Result expression tree is stored CacheMemoryContext to ensure it survives
- * as long as the relcache entry. But we should be running in a less long-lived
- * working context. To avoid leaking cache memory if this routine fails partway
- * through, we build in working memory and then copy the completed structure
- * into cache memory.
+ * We cache a copy of the result in the relcache entry, after constructing
+ * it using the caller's context.  This approach avoids leaking any data
+ * into long-lived cache contexts, especially if we fail partway through.
  */
 static List *
 generate_partition_qual(Relation rel)
@@ -326,8 +327,8 @@ generate_partition_qual(Relation rel)
     /* Guard against stack overflow due to overly deep partition tree */
     check_stack_depth();

-    /* Quick copy */
-    if (rel->rd_partcheck != NIL)
+    /* If we already cached the result, just return a copy */
+    if (rel->rd_partcheckvalid)
         return copyObject(rel->rd_partcheck);

     /* Grab at least an AccessShareLock on the parent table */
@@ -373,13 +374,36 @@ generate_partition_qual(Relation rel)
     if (found_whole_row)
         elog(ERROR, "unexpected whole-row reference found in partition key");

-    /* Save a copy in the relcache */
-    oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-    rel->rd_partcheck = copyObject(result);
-    MemoryContextSwitchTo(oldcxt);
+    /* Assert that we're not leaking any old data during assignments below */
+    Assert(rel->rd_partcheckcxt == NULL);
+    Assert(rel->rd_partcheck == NIL);
+
+    /*
+     * Save a copy in the relcache.  The order of these operations is fairly
+     * critical to avoid memory leaks and ensure that we don't leave a corrupt
+     * relcache entry if we fail partway through copyObject.
+     *
+     * If, as is definitely possible, the partcheck list is NIL, then we do
+     * not need to make a context to hold it.
+     */
+    if (result != NIL)
+    {
+        rel->rd_partcheckcxt = AllocSetContextCreate(CacheMemoryContext,
+                                                     "partition constraint",
+                                                     ALLOCSET_SMALL_SIZES);
+        MemoryContextCopyAndSetIdentifier(rel->rd_partcheckcxt,
+                                          RelationGetRelationName(rel));
+        oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt);
+        rel->rd_partcheck = copyObject(result);
+        MemoryContextSwitchTo(oldcxt);
+    }
+    else
+        rel->rd_partcheck = NIL;
+    rel->rd_partcheckvalid = true;

     /* Keep the parent locked until commit */
     relation_close(parent, NoLock);

+    /* Return the working copy to the caller */
     return result;
 }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 64f3c2e..4b884d5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1175,11 +1175,15 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     }
     else
     {
-        relation->rd_partkeycxt = NULL;
         relation->rd_partkey = NULL;
+        relation->rd_partkeycxt = NULL;
         relation->rd_partdesc = NULL;
         relation->rd_pdcxt = NULL;
     }
+    /* ... but partcheck is not loaded till asked for */
+    relation->rd_partcheck = NIL;
+    relation->rd_partcheckvalid = false;
+    relation->rd_partcheckcxt = NULL;

     /*
      * initialize access method information
@@ -2364,8 +2368,8 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
         MemoryContextDelete(relation->rd_partkeycxt);
     if (relation->rd_pdcxt)
         MemoryContextDelete(relation->rd_pdcxt);
-    if (relation->rd_partcheck)
-        pfree(relation->rd_partcheck);
+    if (relation->rd_partcheckcxt)
+        MemoryContextDelete(relation->rd_partcheckcxt);
     if (relation->rd_fdwroutine)
         pfree(relation->rd_fdwroutine);
     pfree(relation);
@@ -5600,18 +5604,20 @@ load_relcache_init_file(bool shared)
          * format is complex and subject to change).  They must be rebuilt if
          * needed by RelationCacheInitializePhase3.  This is not expected to
          * be a big performance hit since few system catalogs have such. Ditto
-         * for RLS policy data, index expressions, predicates, exclusion info,
-         * and FDW info.
+         * for RLS policy data, partition info, index expressions, predicates,
+         * exclusion info, and FDW info.
          */
         rel->rd_rules = NULL;
         rel->rd_rulescxt = NULL;
         rel->trigdesc = NULL;
         rel->rd_rsdesc = NULL;
-        rel->rd_partkeycxt = NULL;
         rel->rd_partkey = NULL;
-        rel->rd_pdcxt = NULL;
+        rel->rd_partkeycxt = NULL;
         rel->rd_partdesc = NULL;
+        rel->rd_pdcxt = NULL;
         rel->rd_partcheck = NIL;
+        rel->rd_partcheckvalid = false;
+        rel->rd_partcheckcxt = NULL;
         rel->rd_indexprs = NIL;
         rel->rd_indpred = NIL;
         rel->rd_exclops = NULL;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 764e6fa..51ad250 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -95,11 +95,13 @@ typedef struct RelationData
     List       *rd_fkeylist;    /* list of ForeignKeyCacheInfo (see below) */
     bool        rd_fkeyvalid;    /* true if list has been computed */

-    MemoryContext rd_partkeycxt;    /* private memory cxt for the below */
     struct PartitionKeyData *rd_partkey;    /* partition key, or NULL */
-    MemoryContext rd_pdcxt;        /* private context for partdesc */
+    MemoryContext rd_partkeycxt;    /* private context for rd_partkey, if any */
     struct PartitionDescData *rd_partdesc;    /* partitions, or NULL */
+    MemoryContext rd_pdcxt;        /* private context for partdesc, if any */
     List       *rd_partcheck;    /* partition CHECK quals */
+    bool        rd_partcheckvalid;    /* true if list has been computed */
+    MemoryContext rd_partcheckcxt;    /* private cxt for rd_partcheck, if any */

     /* data managed by RelationGetIndexList: */
     List       *rd_indexlist;    /* list of OIDs of indexes on relation */

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

13 April 2019, 15:32:09

Thanks for the updated patch.

On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > As a parenthetical note, I observe that relcache.c seems to know
> > almost nothing about rd_partcheck.  rd_partkey and rd_partdesc both
> > have handling in RelationClearRelation(), but rd_partcheck does not,
> > and I suspect that's wrong.  So the problems are probably not confined
> > to the relcache-drop-time problem that you observed.
>
> I concluded that that's not parenthetical but pretty relevant...

Hmm, do you mean we should grow the same "keep_" logic for
rd_partcheck as we have for rd_partkey and rd_partdesc?  I don't see
it in the updated patch though.

> Attached is a revised version of Amit's patch at [1] that makes these
> data structures be treated more similarly.  I also added some Asserts
> and comment improvements to address the complaints I made upthread about
> under-documentation of all this logic.

Thanks for all the improvements.

> I also cleaned up the problem the code had with failing to distinguish
> "partcheck list is NIL" from "partcheck list hasn't been computed yet".
> For a relation with no such constraints, generate_partition_qual would do
> the full pushups every time.

Actually, callers must have checked that the table is a partition
(relispartition).  It wouldn't be a bad idea to add an Assert or elog
in generate_partition_qual.

>  I'm not sure if the case actually occurs
> commonly enough that that's a performance problem, but failure to account
> for it made my added assertions fall over :-( and I thought fixing it
> was better than weakening the assertions.

Hmm, I wonder why the Asserts failed given what I said above.

> I haven't made back-patch versions yet.  I'd expect they could be
> substantially the same, except the two new fields have to go at the
> end of struct RelationData to avoid ABI breaks.

To save you the pain of finding the right files to patch in
back-branches, I made those (attached).

Thanks,
Amit

Attachment

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 April 2019, 15:53:39

Amit Langote <amitlangote09@gmail.com> writes:
> On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I concluded that that's not parenthetical but pretty relevant...

> Hmm, do you mean we should grow the same "keep_" logic for
> rd_partcheck as we have for rd_partkey and rd_partdesc?  I don't see
> it in the updated patch though.

No, the "keep_" stuff is only necessary when we're trying to preserve
data structures in-place, which is only important if non-relcache
code might be using pointers to it.  Since rd_partcheck is never
directly accessed by external code, only copied, there can't be any
live pointers to it to worry about.  Besides which, since it's load
on demand rather than something that RelationBuildDesc forces to be
valid immediately, any comparison would be guaranteed to fail: the
field in the new reldesc will always be empty at this point.

Perhaps there's an argument that it should be load-immediately not
load-on-demand, but that would be an optimization not a bug fix,
and I'm skeptical that it'd be an improvement anyway.

Probably this is something to revisit whenever somebody gets around to
addressing the whole copy-vs-dont-copy-vs-use-a-refcount business that
we were handwaving about upthread.

>> I also cleaned up the problem the code had with failing to distinguish
>> "partcheck list is NIL" from "partcheck list hasn't been computed yet".
>> For a relation with no such constraints, generate_partition_qual would do
>> the full pushups every time.

> Actually, callers must have checked that the table is a partition
> (relispartition).

That does not mean that it's guaranteed to have any partcheck constraints;
there are counterexamples in the regression tests.  It looks like the main
case is a LIST-partitioned table that has only a default partition.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

13 April 2019, 16:13:20

Amit Langote <amitlangote09@gmail.com> writes:
> To save you the pain of finding the right files to patch in
> back-branches, I made those (attached).

BTW, as far as that goes: I'm of the opinion that the partitioning logic's
factorization in this area is pretty awful, and v12 has made it worse not
better.  It's important IMO that there be a clear distinction between code
that belongs to/can manipulate the relcache, and outside code that ought
at most to examine it (and maybe not even that, depending on where we come
down on this copy-vs-refcount business).  Maybe allowing
utils/cache/partcache.c to be effectively halfway inside the relcache
module is tolerable, but I don't think it's great design.  Allowing
files over in partitioning/ to also have functions inside that boundary
is really not good, especially when you can't even say that all of
partdesc.c is part of relcache.

I"m seriously inclined to put RelationBuildPartitionDesc back where
it was in v11.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

14 April 2019, 17:38:56

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2019/04/10 15:42, Michael Paquier wrote:
>> It seems to me that Tom's argument to push in the way relcache
>> information is handled by copying its contents sounds sensible to me.
>> That's not perfect, but it is consistent with what exists (note
>> PublicationActions for a rather-still-not-much-complex example of
>> structure which gets copied from the relcache).

> I gave my vote for direct access of unchangeable relcache substructures
> (TupleDesc, PartitionDesc, etc.), because they're accessed during the
> planning of every user query and fairly expensive to copy compared to list
> of indexes or PublicationActions that you're citing.  To further my point
> a bit, I wonder why PublicationActions needs to be copied out of relcache
> at all?  Looking at its usage in execReplication.c, it seems we can do
> fine without copying, because we are holding both a lock and a relcache
> reference on the replication target relation during the entirety of
> apply_handle_insert(), which means that information can't change under us,
> neither logically nor physically.

So the point here is that that reasoning is faulty.  You *cannot* assume,
no matter how strong a lock or how many pins you hold, that a relcache
entry will not get rebuilt underneath you.  Cache flushes happen
regardless.  And unless relcache.c takes special measures to prevent it,
a rebuild will result in moving subsidiary data structures and thereby
breaking any pointers you may have pointing into those data structures.

For certain subsidiary structures such as the relation tupdesc,
we do take such special measures: that's what the "keep_xxx" dance in
RelationClearRelation is.  However, that's expensive, both in cycles
and maintenance effort: it requires having code that can decide equality
of the subsidiary data structures, which we might well have no other use
for, and which we certainly don't have strong tests for correctness of.
It's also very error-prone for callers, because there isn't any good way
to cross-check that code using a long-lived pointer to a subsidiary
structure is holding a lock that's strong enough to guarantee non-mutation
of that structure, or even that relcache.c provides any such guarantee
at all.  (If our periodic attempts to reduce lock strength for assorted
DDL operations don't scare the pants off you in this connection, you have
not thought hard enough about it.)  So I think that even though we've
largely gotten away with this approach so far, it's also a half-baked
kluge that we should be looking to get rid of, not extend to yet more
cases.

To my mind there are only two trustworthy solutions to the problem of
wanting time-extended usage of a relcache subsidiary data structure: one
is to copy it, and the other is to reference-count it.  I think that going
over to a reference-count-based approach for many of these structures
might well be something we should do in future, maybe even the very near
future.  In the mean time, though, I'm not really satisfied with inserting
half-baked kluges, especially not ones that are different from our other
half-baked kluges for similar purposes.  I think that's a path to creating
hard-to-reproduce bugs.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

14 April 2019, 19:29:26

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that copying data isn't great.  What I don't agree is that this
>> solution is better.

> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."

I did not say that.  What I did say is that they're both correct
solutions.  Copying large data structures is clearly a potential
performance problem, but that doesn't mean we can take correctness
shortcuts.

> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?

The real problem with that is that we *still* won't know whether it's
okay to throw stuff away or not.  The issue with these subsidiary
data structures is exactly that we're trying to reduce the lock strength
required for changing one of them, and as soon as you make that lock
strength less than AEL, you have the problem that that value may need
to change in relcache entries despite them being pinned.  The code I'm
complaining about is trying to devise a shortcut solution to exactly
that situation ... and it's not a good shortcut.

> The only alternative seems to be to have each subsystem hold its own
> reference count, as I did with the PartitionDirectory stuff, which is
> not something we'd want to scale out.

Well, we clearly don't want to devise a separate solution for every
subsystem, but it doesn't seem out of the question to me that we could
build a "reference counted cache sub-structure" mechanism and apply it
to each of these relcache fields.  Maybe it could be unified with the
existing code in the typcache that does a similar thing.  Sure, this'd
be a fair amount of work, but we've done it before.  Syscache entries
didn't use to have reference counts, for example.

BTW, the problem I have with the PartitionDirectory stuff is exactly
that it *isn't* a reference-counted solution.  If it were, we'd not
have this problem of not knowing when we could free rd_pdcxt.

>> I especially don't care for the much-less-than-half-baked kluge of
>> chaining the old rd_pdcxt onto the new one and hoping that it will go away
>> at a suitable time.

> It will go away at a suitable time, but maybe not at the soonest
> suitable time.  It wouldn't be hard to improve that, though.  The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there.  If there already is one there, parent
> the new one under it.  When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.

Meh.  While it seems likely that that would mask most practical problems,
it still seems like covering up a wart with a dirty bandage.  In
particular, that fix doesn't fix anything unless relcache reference counts
go to zero pretty quickly --- which I'll just note is directly contrary
to your enthusiasm for holding relcache pins longer.

I think that what we ought to do for v12 is have PartitionDirectory
copy the data, and then in v13 work on creating real reference-count
infrastructure that would allow eliminating the copy steps with full
safety.  The $64 question is whether that really would cause unacceptable
performance problems.  To look into that, I made the attached WIP patches.
(These are functionally complete, but I didn't bother for instance with
removing the hunk that 898e5e329 added to relcache.c, and the comments
need work, etc.)  The first one just changes the PartitionDirectory
code to do that, and then the second one micro-optimizes
partition_bounds_copy() to make it somewhat less expensive, mostly by
collapsing lots of small palloc's into one big one.

What I get for test cases like [1] is

single-partition SELECT, hash partitioning:

N       tps, HEAD       tps, patch
2       11426.243754    11448.615193
8       11254.833267    11374.278861
32      11288.329114    11371.942425
128     11222.329256    11185.845258
512     11001.177137    10572.917288
1024    10612.456470    9834.172965
4096    8819.110195     7021.864625
8192    7372.611355     5276.130161

single-partition SELECT, range partitioning:

N       tps, HEAD       tps, patch
2       11037.855338    11153.595860
8       11085.218022    11019.132341
32      10994.348207    10935.719951
128     10884.417324    10532.685237
512     10635.583411    9578.108915
1024    10407.286414    8689.585136
4096    8361.463829     5139.084405
8192    7075.880701     3442.542768

Now certainly these numbers suggest that avoiding the copy could be worth
our trouble, but these results are still several orders of magnitude
better than where we were two weeks ago [2].  Plus, this is an extreme
case that's not really representative of real-world usage, since the test
tables have neither indexes nor any data.  In practical situations the
baseline would be lower and the dropoff less bad.  So I don't feel bad
about shipping v12 with these sorts of numbers and hoping for more
improvement later.

            regards, tom lane

[1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 4d6595b..96129e7 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -43,7 +43,6 @@ typedef struct PartitionDirectoryData
 typedef struct PartitionDirectoryEntry
 {
     Oid            reloid;
-    Relation    rel;
     PartitionDesc pd;
 } PartitionDirectoryEntry;

@@ -235,6 +234,34 @@ RelationBuildPartitionDesc(Relation rel)
 }

 /*
+ * copyPartitionDesc
+ *
+ * Should be somewhere else perhaps?
+ */
+static PartitionDesc
+copyPartitionDesc(PartitionDesc src, PartitionKey key)
+{
+    PartitionDesc partdesc;
+    int            nparts = src->nparts;
+
+    partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
+    partdesc->nparts = nparts;
+    /* If there are no partitions, the rest of the partdesc can stay zero */
+    if (nparts > 0)
+    {
+        partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
+        memcpy(partdesc->oids, src->oids, nparts * sizeof(Oid));
+
+        partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
+        memcpy(partdesc->is_leaf, src->is_leaf, nparts * sizeof(bool));
+
+        partdesc->boundinfo = partition_bounds_copy(src->boundinfo, key);
+    }
+
+    return partdesc;
+}
+
+/*
  * CreatePartitionDirectory
  *        Create a new partition directory object.
  */
@@ -265,7 +292,7 @@ CreatePartitionDirectory(MemoryContext mcxt)
  *
  * The purpose of this function is to ensure that we get the same
  * PartitionDesc for each relation every time we look it up.  In the
- * face of current DDL, different PartitionDescs may be constructed with
+ * face of concurrent DDL, different PartitionDescs may be constructed with
  * different views of the catalog state, but any single particular OID
  * will always get the same PartitionDesc for as long as the same
  * PartitionDirectory is used.
@@ -281,13 +308,16 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
     if (!found)
     {
         /*
-         * We must keep a reference count on the relation so that the
-         * PartitionDesc to which we are pointing can't get destroyed.
+         * Currently, neither RelationGetPartitionDesc nor
+         * RelationGetPartitionKey can leak any memory; but if they start
+         * doing so, do those calls before switching into pdir_mcxt.
          */
-        RelationIncrementReferenceCount(rel);
-        pde->rel = rel;
-        pde->pd = RelationGetPartitionDesc(rel);
+        MemoryContext oldcontext = MemoryContextSwitchTo(pdir->pdir_mcxt);
+
+        pde->pd = copyPartitionDesc(RelationGetPartitionDesc(rel),
+                                    RelationGetPartitionKey(rel));
         Assert(pde->pd != NULL);
+        MemoryContextSwitchTo(oldcontext);
     }
     return pde->pd;
 }
@@ -296,17 +326,11 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
  * DestroyPartitionDirectory
  *        Destroy a partition directory.
  *
- * Release the reference counts we're holding.
+ * This is a no-op at present; caller is supposed to release the memory context.
  */
 void
 DestroyPartitionDirectory(PartitionDirectory pdir)
 {
-    HASH_SEQ_STATUS    status;
-    PartitionDirectoryEntry *pde;
-
-    hash_seq_init(&status, pdir->pdir_hash);
-    while ((pde = hash_seq_search(&status)) != NULL)
-        RelationDecrementReferenceCount(pde->rel);
 }

 /*
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index c8770cc..b7ddbe5 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -782,10 +782,13 @@ partition_bounds_copy(PartitionBoundInfo src,
                       PartitionKey key)
 {
     PartitionBoundInfo dest;
-    int            i;
+    bool        hash_part = (key->strategy == PARTITION_STRATEGY_HASH);
+    Datum *alldatums;
     int            ndatums;
     int            partnatts;
+    int            natts;
     int            num_indexes;
+    int            i;

     dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData));

@@ -793,65 +796,69 @@ partition_bounds_copy(PartitionBoundInfo src,
     ndatums = dest->ndatums = src->ndatums;
     partnatts = key->partnatts;

-    num_indexes = get_partition_bound_num_indexes(src);
-
     /* List partitioned tables have only a single partition key. */
     Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1);

-    dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
-
     if (src->kind != NULL)
     {
+        PartitionRangeDatumKind *allkinds;
+
         dest->kind = (PartitionRangeDatumKind **) palloc(ndatums *
                                                          sizeof(PartitionRangeDatumKind *));
+        allkinds = (PartitionRangeDatumKind *) palloc(ndatums * partnatts *
+                                                      sizeof(PartitionRangeDatumKind));
+
         for (i = 0; i < ndatums; i++)
         {
-            dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts *
-                                                               sizeof(PartitionRangeDatumKind));
+            dest->kind[i] = allkinds;
+            allkinds += partnatts;

             memcpy(dest->kind[i], src->kind[i],
-                   sizeof(PartitionRangeDatumKind) * key->partnatts);
+                   sizeof(PartitionRangeDatumKind) * partnatts);
         }
     }
     else
         dest->kind = NULL;

+    dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
+
+    /*
+     * For hash partitioning, datums arrays will have two elements - modulus
+     * and remainder.
+     */
+    natts = hash_part ? 2 : partnatts;
+
+    alldatums = (Datum *) palloc(sizeof(Datum) * natts * ndatums);
+
     for (i = 0; i < ndatums; i++)
     {
-        int            j;
+        dest->datums[i] = alldatums;
+        alldatums += natts;

         /*
-         * For a corresponding to hash partition, datums array will have two
-         * elements - modulus and remainder.
+         * For hash partitioning, we know that both datums are present and
+         * pass-by-value (since they're int4s), so just memcpy the datum
+         * array.  Otherwise we have to apply datumCopy.
          */
-        bool        hash_part = (key->strategy == PARTITION_STRATEGY_HASH);
-        int            natts = hash_part ? 2 : partnatts;
-
-        dest->datums[i] = (Datum *) palloc(sizeof(Datum) * natts);
-
-        for (j = 0; j < natts; j++)
+        if (hash_part)
         {
-            bool        byval;
-            int            typlen;
-
-            if (hash_part)
-            {
-                typlen = sizeof(int32); /* Always int4 */
-                byval = true;    /* int4 is pass-by-value */
-            }
-            else
+            memcpy(dest->datums[i], src->datums[i], 2 * sizeof(Datum));
+        }
+        else
+        {
+            for (int j = 0; j < natts; j++)
             {
-                byval = key->parttypbyval[j];
-                typlen = key->parttyplen[j];
+                if (dest->kind == NULL ||
+                    dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
+                    dest->datums[i][j] = datumCopy(src->datums[i][j],
+                                                   key->parttypbyval[j],
+                                                   key->parttyplen[j]);
             }
-
-            if (dest->kind == NULL ||
-                dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
-                dest->datums[i][j] = datumCopy(src->datums[i][j],
-                                               byval, typlen);
         }
     }

+    num_indexes = get_partition_bound_num_indexes(src);
+
     dest->indexes = (int *) palloc(sizeof(int) * num_indexes);
     memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

15 April 2019, 06:36:35

On 2019/04/14 0:53, Tom Lane wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
>> On Sat, Apr 13, 2019 at 4:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I concluded that that's not parenthetical but pretty relevant...
> 
>> Hmm, do you mean we should grow the same "keep_" logic for
>> rd_partcheck as we have for rd_partkey and rd_partdesc?  I don't see
>> it in the updated patch though.
> 
> No, the "keep_" stuff is only necessary when we're trying to preserve
> data structures in-place, which is only important if non-relcache
> code might be using pointers to it.  Since rd_partcheck is never
> directly accessed by external code, only copied, there can't be any
> live pointers to it to worry about.  Besides which, since it's load
> on demand rather than something that RelationBuildDesc forces to be
> valid immediately, any comparison would be guaranteed to fail: the
> field in the new reldesc will always be empty at this point.

Ah, that's right.  It was just that you were replying to this:

Robert wrote:
> As a parenthetical note, I observe that relcache.c seems to know
> almost nothing about rd_partcheck.  rd_partkey and rd_partdesc both
> have handling in RelationClearRelation(), but rd_partcheck does not,
> and I suspect that's wrong.

Maybe I just got confused.

> Perhaps there's an argument that it should be load-immediately not
> load-on-demand, but that would be an optimization not a bug fix,
> and I'm skeptical that it'd be an improvement anyway.

Makes sense.

> Probably this is something to revisit whenever somebody gets around to
> addressing the whole copy-vs-dont-copy-vs-use-a-refcount business that
> we were handwaving about upthread.

OK.

>>> I also cleaned up the problem the code had with failing to distinguish
>>> "partcheck list is NIL" from "partcheck list hasn't been computed yet".
>>> For a relation with no such constraints, generate_partition_qual would do
>>> the full pushups every time.
> 
>> Actually, callers must have checked that the table is a partition
>> (relispartition).
> 
> That does not mean that it's guaranteed to have any partcheck constraints;
> there are counterexamples in the regression tests.  It looks like the main
> case is a LIST-partitioned table that has only a default partition.

Ah, yes.  Actually, even a RANGE default partition that's the only
partition of its parent has NIL partition constraint.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

15 April 2019, 08:05:16

On 2019/04/15 2:38, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2019/04/10 15:42, Michael Paquier wrote:
>>> It seems to me that Tom's argument to push in the way relcache
>>> information is handled by copying its contents sounds sensible to me.
>>> That's not perfect, but it is consistent with what exists (note
>>> PublicationActions for a rather-still-not-much-complex example of
>>> structure which gets copied from the relcache).
> 
>> I gave my vote for direct access of unchangeable relcache substructures
>> (TupleDesc, PartitionDesc, etc.), because they're accessed during the
>> planning of every user query and fairly expensive to copy compared to list
>> of indexes or PublicationActions that you're citing.  To further my point
>> a bit, I wonder why PublicationActions needs to be copied out of relcache
>> at all?  Looking at its usage in execReplication.c, it seems we can do
>> fine without copying, because we are holding both a lock and a relcache
>> reference on the replication target relation during the entirety of
>> apply_handle_insert(), which means that information can't change under us,
>> neither logically nor physically.
> 
> So the point here is that that reasoning is faulty.  You *cannot* assume,
> no matter how strong a lock or how many pins you hold, that a relcache
> entry will not get rebuilt underneath you.  Cache flushes happen
> regardless.  And unless relcache.c takes special measures to prevent it,
> a rebuild will result in moving subsidiary data structures and thereby
> breaking any pointers you may have pointing into those data structures.
> 
> For certain subsidiary structures such as the relation tupdesc,
> we do take such special measures: that's what the "keep_xxx" dance in
> RelationClearRelation is.  However, that's expensive, both in cycles
> and maintenance effort: it requires having code that can decide equality
> of the subsidiary data structures, which we might well have no other use
> for, and which we certainly don't have strong tests for correctness of.
> It's also very error-prone for callers, because there isn't any good way
> to cross-check that code using a long-lived pointer to a subsidiary
> structure is holding a lock that's strong enough to guarantee non-mutation
> of that structure, or even that relcache.c provides any such guarantee
> at all.  (If our periodic attempts to reduce lock strength for assorted
> DDL operations don't scare the pants off you in this connection, you have
> not thought hard enough about it.)  So I think that even though we've
> largely gotten away with this approach so far, it's also a half-baked
> kluge that we should be looking to get rid of, not extend to yet more
> cases.

Thanks for the explanation.

I understand that simply having a lock and a nonzero refcount on a
relation doesn't prevent someone else from changing it concurrently.

I get that we want to get rid of the keep_* kludge in the long term, but
is it wrong to think, for example, that having keep_partdesc today allows
us today to keep the pointer to rd_partdesc as long as we're holding the
relation open or refcnt on the whole relation such as with
PartitionDirectory mechanism?

> To my mind there are only two trustworthy solutions to the problem of
> wanting time-extended usage of a relcache subsidiary data structure: one
> is to copy it, and the other is to reference-count it.  I think that going
> over to a reference-count-based approach for many of these structures
> might well be something we should do in future, maybe even the very near
> future.  In the mean time, though, I'm not really satisfied with inserting
> half-baked kluges, especially not ones that are different from our other
> half-baked kluges for similar purposes.  I think that's a path to creating
> hard-to-reproduce bugs.

+1 to reference-count-based approach.

I've occasionally wondered why there is both keep_tupdesc kludge and
refcounting for table TupleDescs.  I guess it's because *only* the
TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is,
refcounting) and the rest of the sites depend on the shaky guarantee
provided by keep_tupdesc.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

15 April 2019, 14:57:48

On Mon, Apr 15, 2019 at 5:05 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2019/04/15 2:38, Tom Lane wrote:
> > So the point here is that that reasoning is faulty.  You *cannot* assume,
> > no matter how strong a lock or how many pins you hold, that a relcache
> > entry will not get rebuilt underneath you.  Cache flushes happen
> > regardless.  And unless relcache.c takes special measures to prevent it,
> > a rebuild will result in moving subsidiary data structures and thereby
> > breaking any pointers you may have pointing into those data structures.
> >
> > For certain subsidiary structures such as the relation tupdesc,
> > we do take such special measures: that's what the "keep_xxx" dance in
> > RelationClearRelation is.  However, that's expensive, both in cycles
> > and maintenance effort: it requires having code that can decide equality
> > of the subsidiary data structures, which we might well have no other use
> > for, and which we certainly don't have strong tests for correctness of.
> > It's also very error-prone for callers, because there isn't any good way
> > to cross-check that code using a long-lived pointer to a subsidiary
> > structure is holding a lock that's strong enough to guarantee non-mutation
> > of that structure, or even that relcache.c provides any such guarantee
> > at all.  (If our periodic attempts to reduce lock strength for assorted
> > DDL operations don't scare the pants off you in this connection, you have
> > not thought hard enough about it.)  So I think that even though we've
> > largely gotten away with this approach so far, it's also a half-baked
> > kluge that we should be looking to get rid of, not extend to yet more
> > cases.
>
> Thanks for the explanation.
>
> I understand that simply having a lock and a nonzero refcount on a
> relation doesn't prevent someone else from changing it concurrently.
>
> I get that we want to get rid of the keep_* kludge in the long term, but
> is it wrong to think, for example, that having keep_partdesc today allows
> us today to keep the pointer to rd_partdesc as long as we're holding the
> relation open or refcnt on the whole relation such as with
> PartitionDirectory mechanism?

Ah, we're also trying to fix the memory leak caused by the current
design of PartitionDirectory.  AIUI, the design assumes that the leak
would occur in fairly rare cases, but maybe not so?  If partitions are
frequently attached/detached concurrently (maybe won't be too uncommon
if reduced lock levels encourages users) causing the PartitionDesc of
a given relation changing all the time, then a planning session that's
holding the PartitionDirectory containing that relation would leak as
many PartitionDescs as there were concurrent changes, I guess.

I see that you've proposed to change the PartitionDirectory design to
copy PartitionDesc as way of keeping it around instead holding the
relation open, but having to resort to that would be unfortunate.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

16 April 2019, 17:35:25

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2019/04/15 2:38, Tom Lane wrote:
>> To my mind there are only two trustworthy solutions to the problem of
>> wanting time-extended usage of a relcache subsidiary data structure: one
>> is to copy it, and the other is to reference-count it.  I think that going
>> over to a reference-count-based approach for many of these structures
>> might well be something we should do in future, maybe even the very near
>> future.  In the mean time, though, I'm not really satisfied with inserting
>> half-baked kluges, especially not ones that are different from our other
>> half-baked kluges for similar purposes.  I think that's a path to creating
>> hard-to-reproduce bugs.

> +1 to reference-count-based approach.

> I've occasionally wondered why there is both keep_tupdesc kludge and
> refcounting for table TupleDescs.  I guess it's because *only* the
> TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is,
> refcounting) and the rest of the sites depend on the shaky guarantee
> provided by keep_tupdesc.

The reason for that is simply that at the time we added TupleDesc
refcounts, we didn't want to do the extra work of making all uses
of relcache entries' tupdescs deal with refcounting; keep_tupdesc
is certainly a kluge, but it works for an awful lot of callers.
We'd have to go back and deal with that more honestly if we go down
this path.

I'm inclined to think we could still allow many call sites to not
do an incr/decr-refcount dance as long as they're just fetching
scalar values out of the relcache's tupdesc, and not keeping any
pointer into it.  However, it's hard to see how to enforce such
a rule mechanically, so it might be impractically error-prone
to allow that.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

16 April 2019, 18:11:44

Amit Langote <amitlangote09@gmail.com> writes:
>> I get that we want to get rid of the keep_* kludge in the long term, but
>> is it wrong to think, for example, that having keep_partdesc today allows
>> us today to keep the pointer to rd_partdesc as long as we're holding the
>> relation open or refcnt on the whole relation such as with
>> PartitionDirectory mechanism?

Well, it's safe from the caller's standpoint as long as a suitable lock is
being held, which is neither well-defined nor enforced in any way :-(

> Ah, we're also trying to fix the memory leak caused by the current
> design of PartitionDirectory.  AIUI, the design assumes that the leak
> would occur in fairly rare cases, but maybe not so?  If partitions are
> frequently attached/detached concurrently (maybe won't be too uncommon
> if reduced lock levels encourages users) causing the PartitionDesc of
> a given relation changing all the time, then a planning session that's
> holding the PartitionDirectory containing that relation would leak as
> many PartitionDescs as there were concurrent changes, I guess.

We should get a relcache inval after a partdesc change, but the problem
with the current code is that that will only result in freeing the old
partdesc if the inval event is processed while the relcache entry has
refcount zero.  Otherwise the old rd_pdcxt is just shoved onto the
context chain, where it could survive indefinitely.

I'm not sure that this is really a huge problem in practice.  The example
I gave upthread shows that a partdesc-changing transaction's own internal
invals do arrive during CommandCounterIncrement calls that occur while the
relcache pin is held; but it seems a bit artificial to assume that one
transaction would do a huge number of such changes.  (Although, hm, maybe
a single-transaction pg_restore run could have an issue.)  Once out of
the transaction, it's okay because we'll again invalidate the entry
at the start of the next transaction, and then the refcount will be zero
and we'll clean up.  For other sessions it'd only happen if they saw the
inval while already holding a pin on the partitioned table, which probably
requires some unlucky timing; and that'd have to happen repeatedly to have
a leak that amounts to anything.

Still, though, I'm unhappy with the code as it stands.  It's risky to
assume that it has no unpleasant behaviors that we haven't spotted yet
but will manifest after v12 is in the field.  And I do not think that
it represents a solid base to build on.  (As an example, if we made
any effort to get rid of the redundant extra inval events that occur
post-transaction, we'd suddenly have a much worse problem here.)
I'd rather go over to the copy-based solution for now, which *is*
semantically sound, and accept that we still have more performance
work to do.  It's not like v12 isn't going to be light-years ahead of
v11 in this area anyway.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

16 April 2019, 22:55:31

On Sun, Apr 14, 2019 at 3:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What I get for test cases like [1] is
>
> single-partition SELECT, hash partitioning:
>
> N       tps, HEAD       tps, patch
> 2       11426.243754    11448.615193
> 8       11254.833267    11374.278861
> 32      11288.329114    11371.942425
> 128     11222.329256    11185.845258
> 512     11001.177137    10572.917288
> 1024    10612.456470    9834.172965
> 4096    8819.110195     7021.864625
> 8192    7372.611355     5276.130161
>
> single-partition SELECT, range partitioning:
>
> N       tps, HEAD       tps, patch
> 2       11037.855338    11153.595860
> 8       11085.218022    11019.132341
> 32      10994.348207    10935.719951
> 128     10884.417324    10532.685237
> 512     10635.583411    9578.108915
> 1024    10407.286414    8689.585136
> 4096    8361.463829     5139.084405
> 8192    7075.880701     3442.542768

I have difficulty interpreting these results in any way other than as
an endorsement of the approach that I took.  It seems like you're
proposing to throw away what is really a pretty substantial amount of
performance basically so that the code will look more like you think
it should look.  But I dispute the idea that the current code is so
bad that we need to do this.  I don't think that's the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

17 April 2019, 09:58:46

On 2019/04/15 4:29, Tom Lane wrote:
> I think that what we ought to do for v12 is have PartitionDirectory
> copy the data, and then in v13 work on creating real reference-count
> infrastructure that would allow eliminating the copy steps with full
> safety.  The $64 question is whether that really would cause unacceptable
> performance problems.  To look into that, I made the attached WIP patches.
> (These are functionally complete, but I didn't bother for instance with
> removing the hunk that 898e5e329 added to relcache.c, and the comments
> need work, etc.)  The first one just changes the PartitionDirectory
> code to do that, and then the second one micro-optimizes
> partition_bounds_copy() to make it somewhat less expensive, mostly by
> collapsing lots of small palloc's into one big one.

Thanks for the patches.  The partition_bound_copy()-micro-optimize one
looks good in any case.

> What I get for test cases like [1] is
> 
> single-partition SELECT, hash partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11426.243754    11448.615193
> 8       11254.833267    11374.278861
> 32      11288.329114    11371.942425
> 128     11222.329256    11185.845258
> 512     11001.177137    10572.917288
> 1024    10612.456470    9834.172965
> 4096    8819.110195     7021.864625
> 8192    7372.611355     5276.130161
> 
> single-partition SELECT, range partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11037.855338    11153.595860
> 8       11085.218022    11019.132341
> 32      10994.348207    10935.719951
> 128     10884.417324    10532.685237
> 512     10635.583411    9578.108915
> 1024    10407.286414    8689.585136
> 4096    8361.463829     5139.084405
> 8192    7075.880701     3442.542768
> 
> Now certainly these numbers suggest that avoiding the copy could be worth
> our trouble, but these results are still several orders of magnitude
> better than where we were two weeks ago [2].  Plus, this is an extreme
> case that's not really representative of real-world usage, since the test
> tables have neither indexes nor any data.

I tested the copyPartitionDesc() patch and here are the results for
single-partition SELECT using hash partitioning, where index on queries
column, and N * 1000 rows inserted into the parent table before the test.
I've confirmed that the plan is always an Index Scan on selected partition
(in PG 11, it's under Append, but in HEAD there's no Append due to 8edd0e794)

N       tps, HEAD       tps, patch      tps, PG 11
2       3093.443043     3039.804101     2928.777570
8       3024.545820     3064.333027     2372.738622
32      3029.580531     3032.755266     1417.706212
128     3019.359793     3032.726006     567.099745
512     2948.639216     2986.987862     98.710664
1024    2971.629939     2882.233026     41.720955
4096    2680.703000     1937.988908     7.035816
8192    2599.120308     2069.271274     3.635512

So, the TPS degrades by 14% when going from 128 partitions to 8192
partitions on HEAD, whereas it degrades by 31% with the patch.

Here are the numbers with no indexes defined on the tables, and no data
inserted.

N       tps, HEAD       tps, patch      tps, PG 11
2       3498.247862     3463.695950     3110.314290
8       3524.430780     3445.165206     2741.340770
32      3476.427781     3427.400879     1645.602269
128     3427.121901     3430.385433     651.586373
512     3394.907072     3335.842183     182.201349
1024    3454.050819     3274.266762     67.942075
4096    3201.266380     2845.974556     12.320716
8192    2955.850804     2413.723443     6.151703

Here, the TPS degrades by 13% when going from 128 partitions to 8192
partitions on HEAD, whereas it degrades by 29% with the patch.

So, the degradation caused by copying the bounds is almost same in both
cases.  Actually, even in the more realistic test with indexes and data,
executing the plan is relatively faster than planning as the partition
count grows, because the PartitionBoundInfo that the planner now copies
grows bigger.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

17 April 2019, 10:00:30

On 2019/04/17 18:58, Amit Langote wrote:
> where index on queries
> column

Oops, I meant "with an index on the queried column".

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Andres Freund

Date:

01 May 2019, 15:59:12

Hi

The message I'm replying to is marked as an open item. Robert, what do
you think needs to be done here before release?  Could you summarize,
so we then can see what others think?

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

01 May 2019, 17:09:07

On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote:
> The message I'm replying to is marked as an open item. Robert, what do
> you think needs to be done here before release?  Could you summarize,
> so we then can see what others think?

The original issue on this thread was that hyrax started running out
of memory when it hadn't been doing so previously.  That happened
because, for complicated reasons, commit
898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
Commits 2455ab48844c90419714e27eafd235a85de23232 and
d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.

In the email at issue, Tom complains about two things.  First, he
complains about the fact that I try to arrange things so that relcache
data remains valid for as long as required instead of just copying it.
Second, he complains about the fact repeated ATTACH and DETACH
PARTITION operations can produce a slow session-lifespan memory leak.

I think it's reasonable to fix the second issue for v12.  I am not
sure how important it is, because (1) the leak is small, (2) it seems
unlikely that anyone would execute enough ATTACH/DETACH PARTITION
operations in one backend for the leak to amount to anything
significant, and (3) if a relcache flush ever happens when you don't
have the relation open, all of the "leaked" memory will be un-leaked.
However, I believe that we could fix it as follows.  First, invent
rd_pdoldcxt and put the first old context there; if that pointer is
already in use, then parent the new context under the old one.
Second, in RelationDecrementReferenceCount, if the refcount hits 0,
nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
you would only keep the old PartitionDesc around until the ref count
hits 0, whereas at present, you keep the old PartitionDesc around
until an invalidation happens while the ref count is 0.

I think the first issue is not v12 material.  Tom proposed to fix it
by copying all the relevant data out of the relcache, but his own
performance results show a pretty significant hit, and AFAICS he
hasn't pointed to anything that's actually broken by the current
approach.  What I think should be done is actually generalize the
approach I took in this patch, so that instead of the partition
directory holding a reference count, the planner or executor holds
one.  Then not only could people who want information about the
PartitionDesc avoid copying stuff from the relcache, but maybe other
things as well.  I think that would be significantly better than
continuing to double-down on the current copy-everything approach,
which really only works well in a world where a table can't have all
that much metadata, which is clearly not true for PostgreSQL any more.
I'm not sure that Tom is actually opposed to this approach -- although
I may have misunderstood his position -- but where we disagree, I
think, is that I see what I did in this commit as a stepping-stone
toward a better world, and he sees it as something that should be
killed with fire until that better world has fully arrived.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Andres Freund

Date:

17 May 2019, 19:59:29

Hi,

On 2019-05-01 13:09:07 -0400, Robert Haas wrote:
> The original issue on this thread was that hyrax started running out
> of memory when it hadn't been doing so previously.  That happened
> because, for complicated reasons, commit
> 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
> hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
> Commits 2455ab48844c90419714e27eafd235a85de23232 and
> d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.
> 
> In the email at issue, Tom complains about two things.  First, he
> complains about the fact that I try to arrange things so that relcache
> data remains valid for as long as required instead of just copying it.
> Second, he complains about the fact repeated ATTACH and DETACH
> PARTITION operations can produce a slow session-lifespan memory leak.
> 
> I think it's reasonable to fix the second issue for v12.  I am not
> sure how important it is, because (1) the leak is small, (2) it seems
> unlikely that anyone would execute enough ATTACH/DETACH PARTITION
> operations in one backend for the leak to amount to anything
> significant, and (3) if a relcache flush ever happens when you don't
> have the relation open, all of the "leaked" memory will be un-leaked.
> However, I believe that we could fix it as follows.  First, invent
> rd_pdoldcxt and put the first old context there; if that pointer is
> already in use, then parent the new context under the old one.
> Second, in RelationDecrementReferenceCount, if the refcount hits 0,
> nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
> you would only keep the old PartitionDesc around until the ref count
> hits 0, whereas at present, you keep the old PartitionDesc around
> until an invalidation happens while the ref count is 0.

That sounds roughly reasonably. Tom, any objections?  I think it'd be
good if we fixed this soon.


> I think the first issue is not v12 material.  Tom proposed to fix it
> by copying all the relevant data out of the relcache, but his own
> performance results show a pretty significant hit,

Yea, the numbers didn't look very convincing.


> and AFAICS he
> hasn't pointed to anything that's actually broken by the current
> approach.  What I think should be done is actually generalize the
> approach I took in this patch, so that instead of the partition
> directory holding a reference count, the planner or executor holds
> one.  Then not only could people who want information about the
> PartitionDesc avoid copying stuff from the relcache, but maybe other
> things as well.  I think that would be significantly better than
> continuing to double-down on the current copy-everything approach,
> which really only works well in a world where a table can't have all
> that much metadata, which is clearly not true for PostgreSQL any more.
> I'm not sure that Tom is actually opposed to this approach -- although
> I may have misunderstood his position -- but where we disagree, I
> think, is that I see what I did in this commit as a stepping-stone
> toward a better world, and he sees it as something that should be
> killed with fire until that better world has fully arrived.

Tom, are you ok with deferring further work here for v13?

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

17 May 2019, 20:36:12

Andres Freund <andres@anarazel.de> writes:
> On 2019-05-01 13:09:07 -0400, Robert Haas wrote:
>> In the email at issue, Tom complains about two things.  First, he
>> complains about the fact that I try to arrange things so that relcache
>> data remains valid for as long as required instead of just copying it.
>> Second, he complains about the fact repeated ATTACH and DETACH
>> PARTITION operations can produce a slow session-lifespan memory leak.
>> 
>> I think it's reasonable to fix the second issue for v12.  I am not
>> sure how important it is, because (1) the leak is small, (2) it seems
>> unlikely that anyone would execute enough ATTACH/DETACH PARTITION
>> operations in one backend for the leak to amount to anything
>> significant, and (3) if a relcache flush ever happens when you don't
>> have the relation open, all of the "leaked" memory will be un-leaked.

Yeah, I did some additional testing that showed that it's pretty darn
hard to get the leak to amount to anything.  The test case that I
originally posted did many DDLs in a single transaction, and it
seems that that's actually essential to get a meaningful leak; as
soon as you exit the transaction the leaked contexts will be recovered
during sinval cleanup.

>> However, I believe that we could fix it as follows.  First, invent
>> rd_pdoldcxt and put the first old context there; if that pointer is
>> already in use, then parent the new context under the old one.
>> Second, in RelationDecrementReferenceCount, if the refcount hits 0,
>> nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
>> you would only keep the old PartitionDesc around until the ref count
>> hits 0, whereas at present, you keep the old PartitionDesc around
>> until an invalidation happens while the ref count is 0.

> That sounds roughly reasonably. Tom, any objections?  I think it'd be
> good if we fixed this soon.

My fundamental objection is that this kluge is ugly as sin.
Adding more ugliness on top of it doesn't improve that; rather
the opposite.

> Tom, are you ok with deferring further work here for v13?

Yeah, I think that that's really what we ought to do at this point.
I'd like to see a new design here, but it's not happening for v12,
and I don't want to add even more messiness that's predicated on
a wrong design.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

04 June 2019, 12:25:32

On Fri, May 17, 2019 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I did some additional testing that showed that it's pretty darn
> hard to get the leak to amount to anything.  The test case that I
> originally posted did many DDLs in a single transaction, and it
> seems that that's actually essential to get a meaningful leak; as
> soon as you exit the transaction the leaked contexts will be recovered
> during sinval cleanup.

My colleague Amul Sul rediscovered this same leak when he tried to
attach lots of partitions to an existing partitioned table, all in the
course of a single transaction.  This seems a little less artificial
than Tom's original reproducer, which involved attaching and detaching
the same partition repeatedly.

Here is a patch that tries to fix this, along the lines I previously
suggested; Amul reports that it does work for him.  I am OK to hold
this for v13 if that's what people want, but I think it might be
smarter to commit it to v12.  Maybe it's not a big leak, but it seems
easy enough to do better, so I think we should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pdoldcxt-v1.patch

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

05 June 2019, 11:03:48

On Tue, Jun 4, 2019 at 9:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 17, 2019 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Yeah, I did some additional testing that showed that it's pretty darn
> > hard to get the leak to amount to anything.  The test case that I
> > originally posted did many DDLs in a single transaction, and it
> > seems that that's actually essential to get a meaningful leak; as
> > soon as you exit the transaction the leaked contexts will be recovered
> > during sinval cleanup.
>
> My colleague Amul Sul rediscovered this same leak when he tried to
> attach lots of partitions to an existing partitioned table, all in the
> course of a single transaction.  This seems a little less artificial
> than Tom's original reproducer, which involved attaching and detaching
> the same partition repeatedly.
>
> Here is a patch that tries to fix this, along the lines I previously
> suggested; Amul reports that it does work for him.

Thanks for the patch.

I noticed a crash with one of the scenarios that I think the patch is
meant to address.  Let me describe the steps:

Attach gdb (or whatever) to session 1 in which we'll run a query that
will scan at least two partitions and set a breakpoint in
expand_partitioned_rtentry().  Run the query, so the breakpoint will
be hit.  Step through up to the start of the following loop in this
function:

    i = -1;
    while ((i = bms_next_member(live_parts, i)) >= 0)
    {
        Oid         childOID = partdesc->oids[i];
        Relation    childrel;
        RangeTblEntry *childrte;
        Index       childRTindex;
        RelOptInfo *childrelinfo;

        /* Open rel, acquiring required locks */
        childrel = table_open(childOID, lockmode);

Note that 'partdesc' in the above code is from the partition
directory.  Before stepping through into the loop, start another
session and attach a new partition.  On into the loop.  When the 1st
partition is opened, its locking will result in
RelationClearRelation() being called on the parent relation due to
partition being attached concurrently, which I observed, is actually
invoked a couple of times due to recursion.  Parent relation's
rd_oldpdcxt will be set in this process, which contains the
aforementioned partition descriptor.

Before moving the loop to process the 2nd partition, attach another
partition in session 2.  When the 2nd partition is opened,
RelationClearRelation() will run again and in one of its recursive
invocations, it destroys the rd_oldpdcxt that was set previously,
taking the partition directory's partition descriptor with it.
Anything that tries to access it later crashes.

I couldn't figure out what to blame here though -- the design of
rd_pdoldcxt or the fact that RelationClearRelation() is invoked
multiple times.  I will try to study it more closely tomorrow.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

06 June 2019, 06:47:59

On Wed, Jun 5, 2019 at 8:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I noticed a crash with one of the scenarios that I think the patch is
> meant to address.  Let me describe the steps:
>
> Attach gdb (or whatever) to session 1 in which we'll run a query that
> will scan at least two partitions and set a breakpoint in
> expand_partitioned_rtentry().  Run the query, so the breakpoint will
> be hit.  Step through up to the start of the following loop in this
> function:
>
>     i = -1;
>     while ((i = bms_next_member(live_parts, i)) >= 0)
>     {
>         Oid         childOID = partdesc->oids[i];
>         Relation    childrel;
>         RangeTblEntry *childrte;
>         Index       childRTindex;
>         RelOptInfo *childrelinfo;
>
>         /* Open rel, acquiring required locks */
>         childrel = table_open(childOID, lockmode);
>
> Note that 'partdesc' in the above code is from the partition
> directory.  Before stepping through into the loop, start another
> session and attach a new partition.  On into the loop.  When the 1st
> partition is opened, its locking will result in
> RelationClearRelation() being called on the parent relation due to
> partition being attached concurrently, which I observed, is actually
> invoked a couple of times due to recursion.  Parent relation's
> rd_oldpdcxt will be set in this process, which contains the
> aforementioned partition descriptor.
>
> Before moving the loop to process the 2nd partition, attach another
> partition in session 2.  When the 2nd partition is opened,
> RelationClearRelation() will run again and in one of its recursive
> invocations, it destroys the rd_oldpdcxt that was set previously,
> taking the partition directory's partition descriptor with it.
> Anything that tries to access it later crashes.
>
> I couldn't figure out what to blame here though -- the design of
> rd_pdoldcxt or the fact that RelationClearRelation() is invoked
> multiple times.  I will try to study it more closely tomorrow.

On further study, I concluded that it's indeed the multiple
invocations of RelationClearRelation() on the parent relation that
causes rd_pdoldcxt to be destroyed prematurely.  While it's
problematic that the session in which a new partition is attached to
the parent relation broadcasts 2 SI messages to invalidate the parent
relation [1], it's obviously better to fix how RelationClearRelation
manipulates rd_pdoldcxt so that duplicate SI messages are not harmful,
fixing the latter is hardly an option.

Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
which seems to fix this issue for me.  In the rare case where inval
messages due to multiple concurrent attachments get processed in a
session holding a reference on a partitioned table, there will be
multiple "old" partition descriptors and corresponding "old" contexts.
All of the old contexts are chained together via context-re-parenting,
with the newest "old" context being accessible from the table's
rd_pdoldcxt.

Thanks,
Amit

[1]: inval.c: AddRelcacheInvalidationMessage() does try to prevent
duplicate messages from being emitted, but the logic to detect
duplicates doesn't work across CommandCounterIncrement().  There are
at two relcache inval requests in ATTACH PARTITION code path, emitted
by SetRelationHasSubclass and StorePartitionBound, resp., which are
separated by at least one CCI, so the 2nd request isn't detected as
the duplicate of the 1st.

Attachment

pdoldcxt-v1-amit-delta.patch

Re: hyrax vs. RelationBuildPartitionDesc

From

Amit Langote

Date:

07 June 2019, 04:02:56

On Thu, Jun 6, 2019 at 3:47 PM Amit Langote <amitlangote09@gmail.com> wrote:
> While it's
> problematic that the session in which a new partition is attached to
> the parent relation broadcasts 2 SI messages to invalidate the parent
> relation [1], it's obviously better to fix how RelationClearRelation
> manipulates rd_pdoldcxt so that duplicate SI messages are not harmful,
> fixing the latter is hardly an option.

Sorry, I had meant to say "fixing the former/the other is hardly an option".

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

07 June 2019, 12:51:56

On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
> which seems to fix this issue for me.

Yeah, that looks right.  I think my patch was full of fuzzy thinking
and inadequate testing; thanks for checking it over and coming up with
the right solution.

Anyone else want to look/comment?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

11 June 2019, 17:57:16

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote:
>> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
>> which seems to fix this issue for me.

> Yeah, that looks right.  I think my patch was full of fuzzy thinking
> and inadequate testing; thanks for checking it over and coming up with
> the right solution.

> Anyone else want to look/comment?

I think the existing code is horribly ugly and this is even worse.
It adds cycles to RelationDecrementReferenceCount which is a hotspot
that has no business dealing with this; the invariants are unclear;
and there's no strong reason to think there aren't still cases where
we accumulate lots of copies of old partition descriptors during a
sequence of operations.  Basically you're just doubling down on a
wrong design.

As I said upthread, my current inclination is to do nothing in this
area for v12 and then try to replace the whole thing with proper
reference counting in v13.  I think the cases where we have a major
leak are corner-case-ish enough that we can leave it as-is for one
release.

            regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

From

Robert Haas

Date:

12 June 2019, 18:53:29

On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the existing code is horribly ugly and this is even worse.
> It adds cycles to RelationDecrementReferenceCount which is a hotspot
> that has no business dealing with this; the invariants are unclear;
> and there's no strong reason to think there aren't still cases where
> we accumulate lots of copies of old partition descriptors during a
> sequence of operations.  Basically you're just doubling down on a
> wrong design.

I don't understand why a function that decrements a reference count
shouldn't be expected to free things if the reference count goes to
zero.

Under what workload do you think adding this to
RelationDecrementReferenceCount would cause a noticeable performance
regression?

I think the change is responsive to your previous complaint that the
timing of stuff getting freed is not very well pinned down.  With this
change, it's much more tightly pinned down: it happens when the
refcount goes to 0.  That is definitely not perfect, but I think that
it is a lot easier to come up with scenarios where the leak
accumulates because no cache flush happens while the relfcount is 0
than it is to come up with scenarios where the refcount never reaches
0.  I agree that the latter type of scenario probably exists, but I
don't think we've come up with one yet.

> As I said upthread, my current inclination is to do nothing in this
> area for v12 and then try to replace the whole thing with proper
> reference counting in v13.  I think the cases where we have a major
> leak are corner-case-ish enough that we can leave it as-is for one
> release.

Is this something you're planning to work on yourself?  Do you have a
design in mind?  Is the idea to reference-count the PartitionDesc?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

From

Tom Lane

Date:

12 June 2019, 19:11:56

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think the change is responsive to your previous complaint that the
> timing of stuff getting freed is not very well pinned down.  With this
> change, it's much more tightly pinned down: it happens when the
> refcount goes to 0.  That is definitely not perfect, but I think that
> it is a lot easier to come up with scenarios where the leak
> accumulates because no cache flush happens while the relfcount is 0
> than it is to come up with scenarios where the refcount never reaches
> 0.  I agree that the latter type of scenario probably exists, but I
> don't think we've come up with one yet.

I don't know why you think that's improbable, given that the changes
around PartitionDirectory-s cause relcache entries to be held open much
longer than before (something I've also objected to on this thread).

>> As I said upthread, my current inclination is to do nothing in this
>> area for v12 and then try to replace the whole thing with proper
>> reference counting in v13.  I think the cases where we have a major
>> leak are corner-case-ish enough that we can leave it as-is for one
>> release.

> Is this something you're planning to work on yourself?

Well, I'd rather farm it out to somebody else, but ...

> Do you have a
> design in mind?  Is the idea to reference-count the PartitionDesc?

What we discussed upthread was refcounting each of the various
large sub-objects of relcache entries, not just the partdesc.
I think if we're going to go this way we should bite the bullet and fix
them all.  I really want to burn down RememberToFreeTupleDescAtEOX() in
particular ... it seems likely to me that that's also a source of
unpleasant memory leaks.

            regards, tom lane