Thread: Relation extension scalability

Relation extension scalability

From

Andres Freund

Date:

29 March 2015, 18:56:34

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
   block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized


The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I've attached my POC for heap extension, but it's really just a POC at
this point.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

0001-WIP-Saner-heap-extension.patch

Re: Relation extension scalability

From

Tom Lane

Date:

29 March 2015, 19:21:54

Andres Freund <andres@2ndquadrant.com> writes:
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized

> The problems come from 4) and 5) potentially each taking a fair
> while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

> There's two things that seem to make sense to me:

> First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.  And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added".  On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.
        regards, tom lane

Re: Relation extension scalability

From

Andres Freund

Date:

29 March 2015, 19:53:01

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> > There's two things that seem to make sense to me:
>
> > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> > and introduce a bufmgr function specifically for extension.
>
> I think that removing P_NEW is likely to require a fair amount of
> refactoring of calling code, so I'm not thrilled with doing that.
> On the other hand, perhaps all that code would have to be touched
> anyway to modify the scope over which the extension lock is held.

It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend().
>
> I'm afraid this would break stuff rather thoroughly, in particular
> handling of out-of-disk-space cases.

Hm. Not a bad point.

> And I really don't see how you get
> consistent behavior at all for multiple concurrent callers if there's no
> locking.

What I was thinking is something like this:

while (true)
{   targetBuffer = AcquireFromFSMEquivalent();
   if (targetBuffer == InvalidBuffer)        targetBuffer = ExtendRelation();
   LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);
   if (BufferHasEnoughSpace(targetBuffer))       break;
   LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{   targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;   buffer = ReadBuffer(rel, targetBlock);
LockBuffer(buffer,BUFFER_LOCK_EXCLUSIVE);   page = BufferGetPage(buffer);   if (PageIsNew(page))   {
PageInit(page);      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);       FlushBuffer(buffer);       break;   }   else   {
 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);       continue;   }
 
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

> One idea that might help is to change smgrextend's API so that it doesn't
> need a buffer to write from, but just has an API of "add a prezeroed block
> on-disk and tell me the number of the block you added".  On the other
> hand, that would then require reading in the block after allocating a
> buffer to hold it (I don't think you can safely assume otherwise) so the
> added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Tom Lane

Date:

29 March 2015, 20:07:58

Andres Freund <andres@2ndquadrant.com> writes:
> On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
>> One idea that might help is to change smgrextend's API so that it doesn't
>> need a buffer to write from, but just has an API of "add a prezeroed block
>> on-disk and tell me the number of the block you added".  On the other
>> hand, that would then require reading in the block after allocating a
>> buffer to hold it (I don't think you can safely assume otherwise) so the
>> added read step might eat any savings.

> Yea, I was thinking that as well. We simply could skip the reading step
> by setting up the contents in the buffer manager without a read in this
> case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager.  There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it.  If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.
        regards, tom lane

Re: Relation extension scalability

From

Andres Freund

Date:

29 March 2015, 20:13:18

On 2015-03-29 16:07:49 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> >> One idea that might help is to change smgrextend's API so that it doesn't
> >> need a buffer to write from, but just has an API of "add a prezeroed block
> >> on-disk and tell me the number of the block you added".  On the other
> >> hand, that would then require reading in the block after allocating a
> >> buffer to hold it (I don't think you can safely assume otherwise) so the
> >> added read step might eat any savings.
> 
> > Yea, I was thinking that as well. We simply could skip the reading step
> > by setting up the contents in the buffer manager without a read in this
> > case...
> 
> No, you can't, at least not if the point is to not be holding any
> exclusive lock by the time you go to talk to the buffer manager.  There
> will be nothing stopping some other backend from writing into that page of
> the file before you can get hold of it.  If the buffer they used to do the
> write has itself gotten recycled, there is nothing left at all to tell you
> your page image is out of date.

That's why I'd proposed restructuring things so that the actual
extension/write to the file only happens once we have the buffer manager
exclusive lock on the individual buffer. While not trvia to implement it
doesn't look prohibitively complex.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Andres Freund

Date:

30 March 2015, 00:49:55

On 2015-03-29 20:02:06 -0400, Robert Haas wrote:
> On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> > As a quick recap, relation extension basically works like:
> > 1) We lock the relation for extension
> > 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> > 3) smgrnblocks() is used to find the new target block
> > 4) We search for a victim buffer (via BufferAlloc()) to put the new
> >    block into
> > 5) If dirty the victim buffer is cleaned
> > 6) The relation is extended using smgrextend()
> > 7) The page is initialized
> >
> > The problems come from 4) and 5) potentially each taking a fair
> > while. If the working set mostly fits into shared_buffers 4) can
> > requiring iterating over all shared buffers several times to find a
> > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> > dirty limits 5) can take a couple seconds.
> 
> Interesting.  I had always assumed the bottleneck was waiting for the
> filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend(). If we instead only do the write once we've read
> > in & locked the page exclusively there's no need for the extension
> > lock. We probably still should write out the new page to the OS
> > immediately once we've initialized it; to avoid creating sparse files.
> >
> > The other reason we need the extension lock is that code like
> > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> > pages that are about to be initilized by the extending backend. I think
> > we should just remove that code and deal with the problem by retrying in
> > the extending backend; that's why I think moving extension to a
> > different file might be helpful.
> 
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

> Once we've initialized the block, a subsequent failure to write or
> fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

30 March 2015, 00:52:10

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.

Interesting.  I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend(). If we instead only do the write once we've read
> in & locked the page exclusively there's no need for the extension
> lock. We probably still should write out the new page to the OS
> immediately once we've initialized it; to avoid creating sparse files.
>
> The other reason we need the extension lock is that code like
> lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> pages that are about to be initilized by the extending backend. I think
> we should just remove that code and deal with the problem by retrying in
> the extending backend; that's why I think moving extension to a
> different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block.  Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more.  If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think.  Maybe I'm misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Tom Lane

Date:

30 March 2015, 01:05:18

Robert Haas <robertmhaas@gmail.com> writes:
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block.  Once
> we've initialized the block, a subsequent failure to write or fsync it
> will be hard to recover from; basically, we won't be able to
> checkpoint any more.  If we discover the problem while the block is
> still all-zeroes, the transaction that uncovers the problem errors
> out, but the system as a whole is still OK.

Yeah.  As Andres says, the fsync is not an important part of that,
but we do expect that ENOSPC will happen during the initial write()
if it's going to happen.

To some extent that's an obsolete assumption, I'm afraid --- I believe
that some modern filesystems don't necessarily overwrite the previous
version of a block, which would mean that they are capable of failing
with ENOSPC even during a re-write of a previously-written block.
However, the possibility of filesystem misfeasance of that sort doesn't
excuse us from having a clear recovery strategy for failures during
relation extension.
        regards, tom lane

Re: Relation extension scalability

From

Amit Kapila

Date:

30 March 2015, 04:04:09

On Mon, Mar 30, 2015 at 12:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> Hello,
>
> Currently bigger shared_buffers settings don't combine well with
> relations being extended frequently. Especially if many/most pages have
> a high usagecount and/or are dirty and the system is IO constrained.
>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.
>

In the past, I have observed in one of the Write-oriented tests that

backend's have to flush the pages by themselves many a times, so

in above situation that can lead to more severe bottleneck.

> I've prototyped solving this for heap relations moving the smgrnblocks()
> + smgrextend() calls to RelationGetBufferForTuple(). With some care
> (including a retry loop) it's possible to only do those two under the
> extension lock. That indeed fixes problems in some of my tests.
>

So do this means that the problem is because of contention on extension

lock?

> I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending

just one block, won't it make sense to increment it by some bigger

number (we can even take input from user for the same where user

can specify how much to autoextend a relation when the relation doesn't

have any empty space). During mdextend(), we might increase just one

block, however we can register the request for background process to

increase the size similar to what is done for fsync.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Andres Freund

Date:

30 March 2015, 10:45:31

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
> In the past, I have observed in one of the Write-oriented tests that
> backend's have to flush the pages by themselves many a times, so
> in above situation that can lead to more severe bottleneck.

Yes.

> > I've prototyped solving this for heap relations moving the smgrnblocks()
> > + smgrextend() calls to RelationGetBufferForTuple(). With some care
> > (including a retry loop) it's possible to only do those two under the
> > extension lock. That indeed fixes problems in some of my tests.
> >
> 
> So do this means that the problem is because of contention on extension
> lock?

Yes, at least commonly. Obviously the extension lock would be less of a
problem if we were better at having clean victim buffer ready.

> > I'm not sure whether the above is the best solution however.
> 
> Another thing to note here is that during extension we are extending
> just one block, won't it make sense to increment it by some bigger
> number (we can even take input from user for the same where user
> can specify how much to autoextend a relation when the relation doesn't
> have any empty space).  During mdextend(), we might increase just one
> block, however we can register the request for background process to
> increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

David Steele

Date:

30 March 2015, 13:29:33

On 3/30/15 6:45 AM, Andres Freund wrote:
> On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
>> Another thing to note here is that during extension we are extending
>> just one block, won't it make sense to increment it by some bigger
>> number (we can even take input from user for the same where user
>> can specify how much to autoextend a relation when the relation doesn't
>> have any empty space).  During mdextend(), we might increase just one
>> block, however we can register the request for background process to
>> increase the size similar to what is done for fsync.
>
> I think that's pretty much a separate patch. Made easier by moving
> things out of under the lock maybe. Other than that I'd prefer not to
> mix things. There's a whole bunch of unrelated complexity that I don't
> want to attach to the topic at the same time (autovacuum truncayting
> again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent.  Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

--
- David Steele
david@pgmasters.net

Re: Relation extension scalability

From

Stephen Frost

Date:

30 March 2015, 15:27:33

* David Steele (david@pgmasters.net) wrote:
> On 3/30/15 6:45 AM, Andres Freund wrote:
> > On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:
> >> Another thing to note here is that during extension we are extending
> >> just one block, won't it make sense to increment it by some bigger
> >> number (we can even take input from user for the same where user
> >> can specify how much to autoextend a relation when the relation doesn't
> >> have any empty space).  During mdextend(), we might increase just one
> >> block, however we can register the request for background process to
> >> increase the size similar to what is done for fsync.
> >
> > I think that's pretty much a separate patch. Made easier by moving
> > things out of under the lock maybe. Other than that I'd prefer not to
> > mix things. There's a whole bunch of unrelated complexity that I don't
> > want to attach to the topic at the same time (autovacuum truncayting
> > again and so on).
>
> Agreed that it makes more sense for this to be in a separate patch, but
> I definitely like the idea.
>
> A user configurable setting would be fine, but better would be to learn
> from the current growth rate of the table and extend based on that.
>
> For, instance, if a table is very large but is only growing by a few
> rows a day, there's probably no need for a large extent.  Conversely, an
> initially small table growing by 1GB per minute would definitely benefit
> from large extents and it would be good to be able to track growth and
> compute extent sizes accordingly.
>
> Of course, a manual setting to start with would cover most use cases.
> Large tables in a database are generally in the minority and known in
> advance.

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more.  The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary.  Now, if someone shows up tomorrow with a patch to do
this and Andres' approach ends up not progressing, then we should
certainly consider it (in due time and with consideration to the
activities going on for 9.5, of course).
Thanks!
    Stephen

Re: Relation extension scalability

From

Amit Kapila

Date:

31 March 2015, 03:48:16

On Mon, Mar 30, 2015 at 8:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
>
> If we're able to extend based on page-level locks rather than the global
> relation locking that we're doing now, then I'm not sure we really need
> to adjust how big the extents are any more. The reason for making
> bigger extents is because of the locking problem we have now when lots
> of backends want to extend a relation, but, if I'm following correctly,
> that'd go away with Andres' approach.
>

The benefit of extending in bigger chunks in background is that backend

would need to perform such an operation at relatively lesser frequency

which in itself could be a win.

> We don't have full patches for either of these and so I don't mind
> saying that, basically, I'd prefer to see if we still have a big
> bottleneck here with lots of backends trying to extend the same relation
> before we work on adding this particular feature in as it might end up
> being unnecessary.

Agreed, I think it is better to first see the results of current

patch on which Andres is working and then if someone is interested

and can show any real benefit with the patch to extend relation

in bigger chunks, then that might be worth consideration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Jim Nasby

Date:

02 April 2015, 00:24:25

On 3/30/15 10:48 PM, Amit Kapila wrote:
>  > If we're able to extend based on page-level locks rather than the global
>  > relation locking that we're doing now, then I'm not sure we really need
>  > to adjust how big the extents are any more.  The reason for making
>  > bigger extents is because of the locking problem we have now when lots
>  > of backends want to extend a relation, but, if I'm following correctly,
>  > that'd go away with Andres' approach.
>  >
>
> The benefit of extending in bigger chunks in background is that backend
> would need to perform such an operation at relatively lesser frequency
> which in itself could be a win.

The other potential advantage (and I have to think this could be a BIG 
advantage) is extending by a large amount makes it more likely you'll 
get contiguous blocks on the storage. That's going to make a big 
difference for SeqScan speed. It'd be interesting if someone with access 
to some real systems could test that. In particular, seqscan of a 
possibly fragmented table vs one of the same size but created at once. 
For extra credit, compare to dd bs=8192 of a file of the same size as 
the overall table.

What I've seen in the real world is very, very poor SeqScan performance 
of tables that were relatively large. So bad that I had to SeqScan 8-16 
tables in parallel to max out the IO system the same way I could with a 
single dd bs=8k of a large file (in this case, something like 480MB/s). 
A single SeqScan would only do something like 30MB/s.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Relation extension scalability

From

Amit Langote

Date:

02 April 2015, 03:12:10

On 02-04-2015 AM 09:24, Jim Nasby wrote:
> The other potential advantage (and I have to think this could be a BIG
> advantage) is extending by a large amount makes it more likely you'll get
> contiguous blocks on the storage. That's going to make a big difference for
> SeqScan speed. It'd be interesting if someone with access to some real systems
> could test that. In particular, seqscan of a possibly fragmented table vs one
> of the same size but created at once. For extra credit, compare to dd bs=8192
> of a file of the same size as the overall table.
> 

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

Amit

[0]
http://www.postgresql.org/message-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com

Re: Relation extension scalability

From

Stephen Frost

Date:

02 April 2015, 03:15:05

* Amit Langote (Langote_Amit_f8@lab.ntt.co.jp) wrote:
> On 02-04-2015 AM 09:24, Jim Nasby wrote:
> > The other potential advantage (and I have to think this could be a BIG
> > advantage) is extending by a large amount makes it more likely you'll get
> > contiguous blocks on the storage. That's going to make a big difference for
> > SeqScan speed. It'd be interesting if someone with access to some real systems
> > could test that. In particular, seqscan of a possibly fragmented table vs one
> > of the same size but created at once. For extra credit, compare to dd bs=8192
> > of a file of the same size as the overall table.
>
> Orthogonal to topic of the thread but this comment made me recall a proposal
> couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
> the case?

As I recall, it didn't, and further, modern filesystems are pretty good
about avoiding fragmentation anyway..

I'm not saying Jim's completely off-base with this idea, I'm just not
sure that it'll really buy us much.
Thanks,
    Stephen

Re: Relation extension scalability

From

Qingqing Zhou

Date:

17 April 2015, 18:19:29

On Sun, Mar 29, 2015 at 11:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I'm not sure whether the above is the best solution however. For one I
> think it's not necessarily a good idea to opencode this in hio.c - I've
> not observed it, but this probably can happen for btrees and such as
> well. For another, this is still a exclusive lock while we're doing IO:
> smgrextend() wants a page to write out, so we have to be careful not to
> overwrite things.
>
I think relaxing a global lock will fix the contention mostly.
However, several people suggested that extending with many pages have
other benefits. This hints for a more fundamental change in our
storage model. Currently we map one file per relation. While it is
simple and robust, considering partitioned table, maybe later columnar
storage are integrated into the core, this model needs some further
thoughts. Think about a 1000 partitioned table with 100 columns: that
is 100K files, no to speak of other forks - surely we can continue
challenging file system's limit or playing around vfds, but we have a
chance now to think ahead.

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.


Regards,
Qingqing

Re: Relation extension scalability

From

Qingqing Zhou

Date:

30 April 2015, 17:50:52

On Fri, Apr 17, 2015 at 11:19 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:
>
> Most commercial database employs a DMS storage model, where it manages
> object mapping and freespace itself. So different objects are sharing
> storage within several files. Surely it has historic reasons, but it
> has several advantages over current model:
> - remove fd pressure
> - remove double buffering (by introducing ADIO)
> - controlled layout and access pattern (sequential and read-ahead)
> - better quota management
> - performance potentially better
>
> Considering platforms supported and the stableness period needed, we
> shall support both current storage model and DMS model. I will stop
> here to see if this deserves further discussion.
>

Sorry, it might considered double-posting here but I am wondering have
we ever discussed this before? If we already have some conclusions on
this, could anyone share me a link?

Thanks,
Qingqing

Re: Relation extension scalability

From

Andres Freund

Date:

19 July 2015, 13:58:58

Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1] of the AXLE project.  Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client     tps after:      tps before:
1          180.255577      210.125143
2          338.231058      391.875088
4          638.814300      405.243901
8          1126.852233     370.922271
16         1242.363623     498.487008
32         1229.648854     484.477042
48         1223.288397     468.127943
64         1198.007422     438.238119
96         1201.501278     370.556354
128        1198.554929     288.213032
196        1189.603398     193.841993
256        1144.082291     191.293781
512        643.323675      200.782105

shared_buffers = 1GB, narrow table contents:
client     tps after:      tps before:
1          191.137410      210.787214
2          351.293017      384.086634
4          649.800991      420.703149
8          1103.770749     355.947915
16         1287.192256     489.050768
32         1226.329585     464.936427
48         1187.266489     443.386440
64         1182.698974     402.251258
96         1208.315983     331.290851
128        1183.469635     269.250601
196        1202.847382     202.788617
256        1177.924515     190.876852
512        572.457773      192.413191

1
shared_buffers = 48GB, wide table contents:
client     tps after:      tps before:
1          59.685215       68.445331
2          102.034688      103.210277
4          179.434065      78.982315
8          222.613727      76.195353
16         232.162484      77.520265
32         231.979136      71.654421
48         231.981216      64.730114
64         230.955979      57.444215
96         228.016910      56.324725
128        227.693947      45.701038
196        227.410386      37.138537
256        224.626948      35.265530
512        105.356439      34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.


So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
   goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
   might concurrently have been written by another backend and removed
   from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound.  I'd like to see some comments before deciding
whether to carry on.


Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)

Attachment

0001-WIP-Saner-heap-extension.patch

Re: Relation extension scalability

From

Andres Freund

Date:

19 July 2015, 14:08:01

Hi,

Eeek, the attached patch included a trivial last minute screwup
(dereferencing bistate unconditionally...). Fixed version attached.

Andres

Attachment

0001-WIP-Saner-heap-extension.patch

Re: Relation extension scalability

From

Tom Lane

Date:

19 July 2015, 15:28:37

Andres Freund <andres@anarazel.de> writes:
> So, to get to the actual meat: My goal was to essentially get rid of an
> exclusive lock over relation extension alltogether. I think I found a
> way to do that that addresses the concerns made in this thread.

> Thew new algorithm basically is:
> 1) Acquire victim buffer, clean it, and mark it as pinned
> 2) Get the current size of the relation, save buffer into blockno
> 3) Try to insert an entry into the buffer table for blockno
> 4) If the page is already in the buffer table, increment blockno by 1,
>    goto 3)
> 5) Try to read the page. In most cases it'll not yet exist. But the page
>    might concurrently have been written by another backend and removed
>    from shared buffers already. If already existing, goto 1)
> 6) Zero out the page on disk.

> I think this does handle the concurrency issues.

The need for (5) kind of destroys my faith in this really being safe: it
says there are non-obvious race conditions here.

For instance, what about this scenario:
* Session 1 tries to extend file, allocates buffer for page 42 (so it's
now between steps 4 and 5).
* Session 2 tries to extend file, sees buffer for 42 exists, allocates
buffer for page 43 (so it's also now between 4 and 5).
* Session 2 tries to read page 43, sees it's not there, and writes out
page 43 with zeroes (now it's done).
* Session 1 tries to read page 42, sees it's there and zero-filled
(not because anybody wrote it, but because holes in files read as 0).

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.  Now we do have mechanisms for reclaiming such pages
but they may not kick in until VACUUM, so you could end up with a whole
lot of table bloat.

Also, the file is likely to end up badly physically fragmented when the
skipped pages are finally filled in.  One of the good things about the
relation extension lock is that the kernel sees the file as being extended
strictly sequentially, which it should handle fairly well as far as
filesystem layout goes.  This way might end up creating a mess on-disk.

Perhaps even more to the point, you've added a read() kernel call that was
previously not there at all, without having removed either the lseek() or
the write().  Perhaps that scales better when what you're measuring is
saturation conditions on a many-core machine, but I have a very hard time
believing that it's not a significant net loss under less-contended
conditions.

I'm inclined to think that a better solution in the long run is to keep
the relation extension lock but find a way to extend files more than
one page per lock acquisition.
        regards, tom lane

Re: Relation extension scalability

From

Andres Freund

Date:

19 July 2015, 15:50:43

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So, to get to the actual meat: My goal was to essentially get rid of an
> > exclusive lock over relation extension alltogether. I think I found a
> > way to do that that addresses the concerns made in this thread.
> 
> > Thew new algorithm basically is:
> > 1) Acquire victim buffer, clean it, and mark it as pinned
> > 2) Get the current size of the relation, save buffer into blockno
> > 3) Try to insert an entry into the buffer table for blockno
> > 4) If the page is already in the buffer table, increment blockno by 1,
> >    goto 3)
> > 5) Try to read the page. In most cases it'll not yet exist. But the page
> >    might concurrently have been written by another backend and removed
> >    from shared buffers already. If already existing, goto 1)
> > 6) Zero out the page on disk.
> 
> > I think this does handle the concurrency issues.
> 
> The need for (5) kind of destroys my faith in this really being safe: it
> says there are non-obvious race conditions here.

It's not simple, I agree. I'm doubtful that an significantly simpler
approach exists.

> For instance, what about this scenario:
> * Session 1 tries to extend file, allocates buffer for page 42 (so it's
> now between steps 4 and 5).
> * Session 2 tries to extend file, sees buffer for 42 exists, allocates
> buffer for page 43 (so it's also now between 4 and 5).
> * Session 2 tries to read page 43, sees it's not there, and writes out
> page 43 with zeroes (now it's done).
> * Session 1 tries to read page 42, sees it's there and zero-filled
> (not because anybody wrote it, but because holes in files read as 0).
> 
> At this point session 1 will go and create page 44, won't it, and you
> just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

> Also, the file is likely to end up badly physically fragmented when the
> skipped pages are finally filled in.  One of the good things about the
> relation extension lock is that the kernel sees the file as being extended
> strictly sequentially, which it should handle fairly well as far as
> filesystem layout goes.  This way might end up creating a mess on-disk.

I don't think that'll actually happen with any recent
filesystems. Pretty much all of them do delayed allocation.  But it
definitely is a concern with older filesystems.

I've just measured and with ext4 the number of extents per segment in a
300GB relation don't show a significant difference when compared between
the existing and new code.

We could try to address this by optionally using posix_fallocate() to do
the actual extension - then there'll not be sparse regions, but actually
allocated disk blocks.

> Perhaps even more to the point, you've added a read() kernel call that was
> previously not there at all, without having removed either the lseek() or
> the write().  Perhaps that scales better when what you're measuring is
> saturation conditions on a many-core machine, but I have a very hard time
> believing that it's not a significant net loss under less-contended
> conditions.

Yes, this has me worried too.

> I'm inclined to think that a better solution in the long run is to keep
> the relation extension lock but find a way to extend files more than
> one page per lock acquisition.

I doubt that'll help as much. As long as you have to search and write
out buffers under an exclusive lock that'll be painful.  You might be
able to make that an infrequent occurrance by extending in larger
amounts and entering the returned pages into the FSM, but you'll have
rather noticeable latency increases everytime that happens. And not just
in the extending relation - all the other relations will wait for the
one doing the extending.  We could move that into some background
process, but at that point things have gotten seriously complex.

The more radical solution would be to have some place in memory that'd
store the current number of blocks. Then all the extension specific
locking we'd need would be around incrementing that.  But how and where
to store that isn't easy.

Greetings,

Andres Freund

Re: Relation extension scalability

From

Tom Lane

Date:

19 July 2015, 15:56:57

Andres Freund <andres@anarazel.de> writes:
> On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
>> At this point session 1 will go and create page 44, won't it, and you
>> just wasted a page.

> My local code now recognizes that case and uses the page. We just need
> to do an PageIsNew().

Er, what?  How can you tell whether an all-zero page was or was not
just written by another session?
        regards, tom lane

Re: Relation extension scalability

From

Andres Freund

Date:

19 July 2015, 16:07:51

On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> >> At this point session 1 will go and create page 44, won't it, and you
> >> just wasted a page.
> 
> > My local code now recognizes that case and uses the page. We just need
> > to do an PageIsNew().
> 
> Er, what?  How can you tell whether an all-zero page was or was not
> just written by another session?

The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

Andres

Re: Relation extension scalability

From

Dilip Kumar

Date:

18 December 2015, 05:21:29

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on big machine once i get access to that.

Just wanted to share the current result what i get i my local machine

Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of RAM).

Test Script:

./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"

./psql -d postgres -c "checkpoint"

./pgbench -f copy_script -T 120 -c$ -j$ postgres

Summary of the results:

1. When data fits into shared buffer improvement is not visible, but when it don't then some improvement is visible in my local machine (still does not seems to be scaling, may be we can see some different behaviour in big machine), Thats because in first case it need not to flush the buffer out.

2. As per Tom's analysis since we are doing extra read it will reduce performance in lower no of client where RelationExtensionLock is not bottleneck and same is visible in test result.

As discussed earlier, what about keeping the RelationExtentionLock as it is and just do the victim buffer search and buffer flushing outside the lock, that way we can save extra read. Correct me if i have missed something in this..

Shared Buffer 512 MB

-----------------------------

Client: Tps Base Tps Patch

1 145 126

2 211 246

4 248 302

8 225 234

Shared Buffer 5GB

-----------------------------

Client: Tps Base Tps Patch

1 165 156

2 237 244

4 294 296

8 253 247

Also observed one problem with patch,

@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,

+ while(true)

+ {

+ buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);

bistate can be NULL if direct insert instead of copy case

Regards,

Dilip Kumar

EnterpriseDB: http://www.enterprisedb.com

On Sun, Jul 19, 2015 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> >> At this point session 1 will go and create page 44, won't it, and you
> >> just wasted a page.
>
> > My local code now recognizes that case and uses the page. We just need
> > to do an PageIsNew().
>
> Er, what? How can you tell whether an all-zero page was or was not
> just written by another session?

The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Relation extension scalability

From

Dilip Kumar

Date:

31 December 2015, 11:23:05

On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

This time i have done some testing on big machine with 64 physical cores @ 2.13GHz and 50GB of RAM

There is performance comparison of base, extend without RelationExtensionLock patch given by Andres and

multi-extend patch (this will extend the multiple blocks at a time based on a configuration parameter.)

Problem Analysis:
------------------------

1. With base code when i try to observe the problem using perf and other method (gdb), i found that RelationExtensionLock is main bottleneck.

2. Then after using RelationExtensionLock Free patch, i observed now contention is FileWrite (All backends are trying to extend the file.)

Performance Summary and Analysis:
------------------------------------------------

1. In my performance results Multi Extend shown best performance and scalability.

2. I think by extending in multiple blocks we solves both the problem(Extension Lock and Parallel File Write).

3. After extending one Block immediately adding to FSM so in most of the cases other backend can directly find it without taking extension lock.

Currently the patch is in initial stage, i have done only test performance and pass the regress test suit.

Open problems
-----------------------------

1. After extending the page we are adding it directly to FSM, so if vacuum find this page as new it will give WARNING.

2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit, same need to be consider for index cases.

Test Script:
-------------------------

./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"

./psql -d postgres -c "checkpoint"

./pgbench -f copy_script -T 120 -c$ -j$ postgres

Performanec Data:
--------------------------

There are Three code base for performance

1. Base Code

2. Lock Free Patch : patch given in below thread
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de

3. Multi extend patch attached in the mail.
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..

may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

Shared Buffer 48 GB
Clients	Base (TPS)	Lock Free Patch	Multi-extend extend_num_pages=5
1	142	138	148
2	251	253	280
4	237	416	464
8	168	491	575
16	141	448	404
32	122	337	332

Shared Buffer 64 MB
Clients	Base (TPS)	Multi-extend extend_num_pages=5
1	140	148
2	252	266
4	229	437
8	153	475
16	132	364

Regards,

Dilip Kumar

EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v1.patch

Re: Relation extension scalability

From

Robert Haas

Date:

06 January 2016, 20:06:27

On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

This time i have done some testing on big machine with 64 physical cores @ 2.13GHz and 50GB of RAM

There is performance comparison of base, extend without RelationExtensionLock patch given by Andres and
multi-extend patch (this will extend the multiple blocks at a time based on a configuration parameter.)

Problem Analysis:
------------------------
1. With base code when i try to observe the problem using perf and other method (gdb), i found that RelationExtensionLock is main bottleneck.
2. Then after using RelationExtensionLock Free patch, i observed now contention is FileWrite (All backends are trying to extend the file.)

Performance Summary and Analysis:
------------------------------------------------
1. In my performance results Multi Extend shown best performance and scalability.
2. I think by extending in multiple blocks we solves both the problem(Extension Lock and Parallel File Write).
3. After extending one Block immediately adding to FSM so in most of the cases other backend can directly find it without taking extension lock.

Currently the patch is in initial stage, i have done only test performance and pass the regress test suit.

Open problems
-----------------------------
1. After extending the page we are adding it directly to FSM, so if vacuum find this page as new it will give WARNING.
2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit, same need to be consider for index cases.

Test Script:
-------------------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Performanec Data:
--------------------------
There are Three code base for performance
1. Base Code

2. Lock Free Patch : patch given in below thread
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de

3. Multi extend patch attached in the mail.
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

Shared Buffer 48 GB

Clients Base (TPS) Lock Free Patch
Multi-extend extend_num_pages=5
1 142 138 148
2 251 253 280
4 237 416 464
8 168 491 575
16 141 448 404
32 122 337 332

Shared Buffer 64 MB

Clients Base (TPS) Multi-extend extend_num_pages=5
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

I'm not really sure what this email is trying to say.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Amit Kapila

Date:

07 January 2016, 11:19:00

On Thu, Jan 7, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Performanec Data:
--------------------------
There are Three code base for performance
1. Base Code

2. Lock Free Patch : patch given in below thread
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de

3. Multi extend patch attached in the mail.
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

Shared Buffer 48 GB

Clients Base (TPS) Lock Free Patch
Multi-extend extend_num_pages=5
1 142 138 148
2 251 253 280
4 237 416 464
8 168 491 575
16 141 448 404
32 122 337 332

Shared Buffer 64 MB

Clients Base (TPS) Multi-extend extend_num_pages=5
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

I'm not really sure what this email is trying to say.

What I could understand from above e-mail is that Dilip has tried to

extend relation multiple-pages-at-a-time and observed that it gives

comparable or better performance as compare to Andres's idea of

lock-free extension and it doesn't regress the low-thread count case.

Now, I think here point to discuss is that there could be multiple-ways

for extending a relation multiple-pages-at-a-time like:

a. Extend the relation page by page and add it to FSM without initializing

it. I think this is what the current patch of Dilip seems to be doing. If we

want to go via this route, then we need to ensure that whenever we get

the page from FSM, if it is empty and not initialised, then initialise it.

b. Extend the relation page by page, initialize it and add it to FSM.

c. Extend the relation *n* pages at a time (in mdextend, have a provision

to do FILEWRITE for multiples of BLCKSZ). Here again, we need to

evaluate what is better way to add it to FSM (after Page initialization or

before page initialization).

d. Use some form of Group Extension, which means only one backend

will extend the relation and others will piggyback there request of

extension to that backend and wait for extension.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Andres Freund

Date:

07 January 2016, 11:23:52

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
> What I could understand from above e-mail is that Dilip has tried to
> extend relation multiple-pages-at-a-time and observed that it gives
> comparable or better performance as compare to Andres's idea of
> lock-free extension and it doesn't regress the low-thread count case.

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

> Now, I think here point to discuss is that there could be multiple-ways
> for extending a relation multiple-pages-at-a-time like:
> 
> a. Extend the relation page by page and add it to FSM without initializing
> it.  I think this is what the current patch of Dilip seems to be doing. If
> we
> want to go via this route, then we need to ensure that whenever we get
> the page from FSM, if it is empty and not initialised, then initialise
> it.

I think that's pretty much unacceptable, for the non-error path at
least.

One very simple, linux specific, approach would be to simply do
fallocate(FALLOC_FL_KEEP_SIZE) to extend the file, that way space is
pre-allocated, but not yet marked as allocated.

Greetings,

Andres Freund

Re: Relation extension scalability

From

Dilip Kumar

Date:

12 January 2016, 09:12:01

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each page and initializing the page, only after that adding to FSM.

> a. Extend the relation page by page and add it to FSM without initializing
> it. I think this is what the current patch of Dilip seems to be doing. If
> we
I think that's pretty much unacceptable, for the non-error path at
least.

Performance results:
----------------------------
Test Case:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";

echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Test Summary:
--------------------
1. I have measured the performance of base and patch.
2. With patch there are multiple results, that are with different values of "extend_num_pages" (parameter which says how many extra block to extend)

Test with Data on magnetic Disk and WAL on SSD
--------------------------------------------------------------------
Shared Buffer :    48GB
max_wal_size :    10GB
Storage          : Magnetic Disk
WAL               : SSD

                         tps with different value of extend_num_page
                         ------------------------------------------------------------
Client   Base     10-Page   20-Page   50-Page

1        105          103         157         129
2        217          219         255         288
4        210          421         494         486
8        166          605         702         701
16       145          484         563         686
32       124          477         480         745

Test with Data and WAL on SSD
-----------------------------------------------
Shared Buffer :   48GB
Max Wal Size :   10GB
Storage          :   SSD

                         tps with different value of extend_num_page
                         ------------------------------------------------------------
Client   Base     10-Page   20-Page   50-Page   100-Page

1          152          153          155          147          157
2          281          281          292          275          287
4          236          505          502          508          514
8          171          662          687          767          764
16         145          527          639          826          907

Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.

Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?

Apart from this approach, I also tried extending the file in multiple block in one extend call, but this approach (extending one by one) is performing better.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v2.patch

Re: Relation extension scalability

From

Amit Kapila

Date:

23 January 2016, 06:49:48

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each page and initializing the page, only after that adding to FSM.

Few comments about patch:

Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier

! page = BufferGetPage(buffer);

! PageInit(page, BufferGetPageSize

(buf), 0);

! freespace = PageGetHeapFreeSpace(page);

MarkBufferDirty(buffer);

! UnlockReleaseBuffer(buffer);

RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically

be markerd dirty once the page is used? I think it is required

if you wish to WAL-log this action.

3. I think you don't need to multi-extend a relation if

HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to

get a new page by extending a relation.

4. Again why do you need this multi-extend optimization for local

relations (those only accessible to current backend)?

5. Do we need this for nbtree as well? One way to check that

is by Copying large data in table having index.

Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

Also, it is good to check the performance of patch with read-write work

load to ensure that extending relation in multiple-chunks should not

regress such cases.

Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level

extend_by_blocks or something like that instead of GUC. The

default value of this parameter should be 1 which means retain

current behaviour by default.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Amit Kapila

Date:

23 January 2016, 10:58:42

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each page and initializing the page, only after that adding to FSM.

Few comments about patch:

I found one more problem with patch.

! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

You can't call BufferGetBlockNumber(buffer) after releasing

the pin on buffer which will be released by

UnlockReleaseBuffer(). Get the block number before unlocking

the buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

25 January 2016, 06:29:55

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote

Few comments about patch:

Thanks for reviewing..

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier

Oh, My mistake, my preprocessor is ignoring this error and relacing it with BLKSIZE

I will fix in next version of patch.

2.
! page = BufferGetPage(buffer);
! PageInit(page, BufferGetPageSize
(buf), 0);
!
! freespace = PageGetHeapFreeSpace(page);
!
MarkBufferDirty(buffer);
! UnlockReleaseBuffer(buffer);
!
RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically
be markerd dirty once the page is used? I think it is required
if you wish to WAL-log this action.

These pages are not going to be used immediately and we have done PageInit so i think it should be marked dirty before adding to FSM, so that if buffer get replaced out then it flushes the init data.

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in current transaction it will not take pages from FSM and everytime it will do multi-extend, however pages will be used if there are parallel backend, but still not a good idea to extend every time in multiple chunk in current backend.

So i will change this..

4. Again why do you need this multi-extend optimization for local
relations (those only accessible to current backend)?

I think we can change this while adding the table level "extend_by_blocks" for local table we will not allow this property, so no need to change at this place.

What do you think ?

5. Do we need this for nbtree as well? One way to check that
is by Copying large data in table having index.

Ok, i will try this test and update.

Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

25 January 2016, 06:31:03

On Sat, Jan 23, 2016 at 4:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I found one more problem with patch.

! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

You can't call BufferGetBlockNumber(buffer) after releasing
the pin on buffer which will be released by
UnlockReleaseBuffer(). Get the block number before unlocking
the buffer.

Good catch, will fix this also in next version.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 January 2016, 11:23:12

On Mon, Jan 25, 2016 at 11:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier
Oh, My mistake, my preprocessor is ignoring this error and relacing it with BLKSIZE

FIXED

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

So i will change this..

FIXED

4. Again why do you need this multi-extend optimization for local
relations (those only accessible to current backend)?

I think we can change this while adding the table level "extend_by_blocks" for local table we will not allow this property, so no need to change at this place.
What do you think ?

Now I have added table level parameter for specifying the number of blocks, So do you think that we still need to block it, as user can control it, Moreover i think if user want to use for local table then no harm in it at least by extending in one shot he avoid multiple call of Extension lock, though there will be no contention.

What is your opinion ?

5. Do we need this for nbtree as well? One way to check that
is by Copying large data in table having index.

Ok, i will try this test and update.

I tried to load data to table with Index and tried to analyze bottleneck using perf, and found btcompare was taking maximum time, still i don't deny that it can not get benefited by multi extend.

So i tried quick POC for this, but i realize that even though we extend multiple page for index and add to FSM, it will be updated only in current page, Information about new free page will be propagated to root page only during vacuum, And unlike heap Btree always search FSM from root and it will not find the extra added pages.

So i think we can analyze this topic separately for index multi extend and find is there are cases where index multi extend can give benefit.

Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

Date with un-logged table

Test Init:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "create unlogged table data (data text)" --> base
./psql -d postgres -c "create unlogged table data (data text) with(extend_by_blocks=50)" --patch

test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer    48GB
Table:    Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

Clients    Base    patch
1            178   180
2            337       338
4            265       601
8            167       805

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

Ok

I did not find in regression in normal case.

Note: I tested it with previous patch extend_num_pages=10 (guc parameter) so that we can see any impact on overall system.

Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.
+1

Changed it to table level storage parameter. I kept max_limit to 100 any suggestion on this ? should we increase it ?

latest patch is attached..

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v3.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 January 2016, 11:27:11

On Thu, Jan 28, 2016 at 4:53 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I did not find in regression in normal case.
Note: I tested it with previous patch extend_num_pages=10 (guc parameter) so that we can see any impact on overall system.

Just forgot to mentioned That i have run pgbench read-write case.

S.F: 300

./pgbench -j $ -c $ -T 1800 -M Prepared postgres

Tested with 1,8,16,32,64 Threads and did not see any regression with patch.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

02 February 2016, 15:12:48

On Thu, Jan 28, 2016 at 6:23 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [ new patch ]

This patch contains a useless hunk and also code not in PostgreSQL
style.  Get pgindent set up and it will do it correctly for you, or
look at the style of the surrounding code.

What I'm a bit murky about is *why* this should be any better than the
status quo.  I mean, the obvious explanation that pops to mind is that
extending the relation by two pages at a time relieves pressure on the
relation extension lock somehow.  One other possible explanation is
that calling RecordPageWithFreeSpace() allows multiple backends to get
access to that page at the same time, while otherwise each backend
would try to conduct a separate extension.  But in the first case,
what we ought to do is try to make the locking more efficient; and in
the second case, we might want to think about recording the first page
in the free space map too.  I don't think the problem here is that w

Here's a sketch of another approach to this problem.  Get rid of the
relation extension lock.  Instead, have an array of, say, 256 lwlocks.
Each one protects the extension of relations where hash(relfilenode) %
256 maps to that lock.  To extend a relation, grab the corresponding
lwlock, do the work, then release the lwlock.  You might occasionally
have a situation where two relations are both being extended very
quickly and happen to map to the same bucket, but that shouldn't be
much of a problem in practice, and the way we're doing it presently is
worse, not better, since two relation extension locks may very easily
map to the same lock manager partition.  The only problem with this is
that acquiring an LWLock holds off interrupts, and we don't want
interrupts to be locked out across a potentially lengthy I/O.  We
could partially fix that if we call RESUME_INTERRUPTS() after
acquiring the lock and HOLD_INTERRUPTS() just before releasing it, but
there's still the problem that you might block non-interruptibly while
somebody else has the lock.  I don't see an easy solution to that
problem right off-hand, but if something like this performs well we
can probably conjure up some solution to that problem.

I'm not saying that we need to do that exact thing - but I am saying
that I don't think we can proceed with an approach like this without
first understanding why it works and whether there's some other way
that might be better to address the underlying problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Andres Freund

Date:

02 February 2016, 15:49:29

On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> test_script:
> ------------
> ./psql -d postgres -c "truncate table data"
> ./psql -d postgres -c "checkpoint"
> ./pgbench -f copy_script -T 120 -c$ -j$ postgres
> 
> Shared Buffer    48GB
> Table:    Unlogged Table
> ench -c$ -j$ -f -M Prepared postgres
> 
> Clients    Base    patch
> 1            178       180
> 2            337       338
> 4            265       601
> 8            167       805

Could you also measure how this behaves for an INSERT instead of a COPY
workload? Both throughput and latency. It's quite possible that this
causes latency hikes, because suddenly backends will have to wait for
one other to extend by 50 pages. You'll probably have to use -P 1 or
full statement logging to judge that.  I think just having a number of
connections inserting relatively wide rows into one table should do the
trick.

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios. The copy scenario here probably isn't too bad because the
copy ring buffes are in use, and because there's no reads increasing the
usagecount of recent buffers; thus a victim buffers are easily found.

Thanks,

Andres

Re: Relation extension scalability

From

Andres Freund

Date:

02 February 2016, 16:00:25

On 2016-02-02 10:12:38 -0500, Robert Haas wrote:
> Here's a sketch of another approach to this problem.  Get rid of the
> relation extension lock.  Instead, have an array of, say, 256 lwlocks.
> Each one protects the extension of relations where hash(relfilenode) %
> 256 maps to that lock.  To extend a relation, grab the corresponding
> lwlock, do the work, then release the lwlock.  You might occasionally
> have a situation where two relations are both being extended very
> quickly and happen to map to the same bucket, but that shouldn't be
> much of a problem in practice, and the way we're doing it presently is
> worse, not better, since two relation extension locks may very easily
> map to the same lock manager partition.

I guess you suspect that the performance problems come from the
heavyweight lock overhead? That's not what I *think* I've seen in
profiles, but it's hard to conclusively judge that.

I kinda doubt that really solves the problem, profiles aside,
though. The above wouldn't really get rid of the extension locks, it
just changes the implementation a bit. We'd still do victim buffer
search, and filesystem operations, while holding an exclusive
lock. Batching can solve some of that, but I think primarily we need
more granular locking, or get rid of locks entirely.

> The only problem with this is that acquiring an LWLock holds off
> interrupts, and we don't want interrupts to be locked out across a
> potentially lengthy I/O.  We could partially fix that if we call
> RESUME_INTERRUPTS() after acquiring the lock and HOLD_INTERRUPTS()
> just before releasing it, but there's still the problem that you might
> block non-interruptibly while somebody else has the lock.  I don't see
> an easy solution to that problem right off-hand, but if something like
> this performs well we can probably conjure up some solution to that
> problem.

Hm. I think to get rid of the HOLD_INTERRUPTS() we'd have to to record
what lock we were waiting on, and in which mode, before going into
PGSemaphoreLock(). Then LWLockReleaseAll() could "hand off" the wakeup
to the next waiter in the queue. Without that we'd sometimes end up with
absorbing a wakeup without then releasing the lock, causing everyone to
block on a released lock.

There's probably two major questions around that: Will it have a
performance impact, and will there be any impact on existing callers?

Regards,

Andres

Re: Relation extension scalability

From

Robert Haas

Date:

02 February 2016, 16:00:39

On Tue, Feb 2, 2016 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios. The copy scenario here probably isn't too bad because the
> copy ring buffes are in use, and because there's no reads increasing the
> usagecount of recent buffers; thus a victim buffers are easily found.

I agree that's an avenue we should try to explore.  I haven't had any
time to think much about how it should be done, but it seems like it
ought to be possible somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Alvaro Herrera

Date:

02 February 2016, 17:07:01

Andres Freund wrote:

> Could you also measure how this behaves for [...]

While we're proposing benchmark cases -- I remember this being an issue
with toast tables getting very large values of xml which causes multiple
toast pages to be extended for each new value inserted.  If there are
multiple processes inserting these all the time, things get clogged.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Relation extension scalability

From

Amit Kapila

Date:

05 February 2016, 11:20:31

On Tue, Feb 2, 2016 at 9:19 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> > test_script:
> > ------------
> > ./psql -d postgres -c "truncate table data"
> > ./psql -d postgres -c "checkpoint"
> > ./pgbench -f copy_script -T 120 -c$ -j$ postgres
> >
> > Shared Buffer 48GB
> > Table: Unlogged Table
> > ench -c$ -j$ -f -M Prepared postgres
> >
> > Clients Base patch
> > 1 178 180
> > 2 337 338
> > 4 265 601
> > 8 167 805
>
> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?

I think such a test will be useful.

>
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.

I think the problem for victim buffer could be visible if the blocks

are dirty and it needs to write the dirty buffer and especially as

the patch is written where after acquiring the extension lock, it again

tries to extend the relation without checking if it can get a page with

space from FSM. It seems to me that we should re-check the

availability of page because while one backend is waiting on extension

lock, other backend might have added pages. To re-check the

availability we might want to use something similar to

LWLockAcquireOrWait() semantics as used during WAL writing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

10 February 2016, 05:02:54

On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?

I think such a test will be useful.

I have measured the performance with insert to see the behavior when it don't use strategy. I have tested multiple option, small tuple, big tuple, data fits in shared buffer and doesn't fit in shared buffer.

Observation:
------------------
Apart from this test I have also used some tool which can take a many stack traces with some delay.
1. I have observed with base code (when data don't fits in shared buffer) almost all the stack traces are waiting on "LockRelationForExtension" and
many on "FlushBuffer" also (Flushing the dirty buffer).

Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187

(This test run with 8 thread (shared buf 512MB) and after every 5 second stack is captured.)

2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but still LockRelationForExtension remains in very high number.

3.Performance of base code in both the cases when Data fits in shared buffers or doesn't fits in shared buffer remain very low and non-scaling(we can see that in below results).

Test--1 (big record insert and Data fits in shared Buffer)
------------------------------------------------------------
setup
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=$variable)" {create table data (a int) for base code}
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres


client    base    extend_by_block=50    extend_by_block=1000
1    113                115                            118
4    50                  220                            216
8    43                  202                            302

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
setup:
-------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

test:
------
shared_buffers=512MB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client    base    extend_by_block=1000
1         125        125
4         49          236
8         41          294
16        39         279

Test--3 (small record insert and Data fits in shared Buffer)
------------------------------------------------------------------
setup:
--------
./psql -d postgres -c "create table test_data(a int)"
./psql -d postgres -c "insert into test_data values(generate_series(1,10000))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)"
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client    base    Patch-extend_by_block=20
1          137          143
2          269          250
4          377          443
8         170           690
16        145      745

*All test done with Data on MD and Wal on SSD

Note: Last patch have max limit of extend_by_block=100 so for taking performance with extend_by_block=1000 i localy changed it.
I will send the modified patch once we finalize on which approach to proceed with.

> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.
>

I think the problem for victim buffer could be visible if the blocks
are dirty and it needs to write the dirty buffer and especially as
the patch is written where after acquiring the extension lock, it again
tries to extend the relation without checking if it can get a page with
space from FSM. It seems to me that we should re-check the
availability of page because while one backend is waiting on extension
lock, other backend might have added pages. To re-check the
availability we might want to use something similar to
LWLockAcquireOrWait() semantics as used during WAL writing.

I will work on this in next version...

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Andres Freund

Date:

10 February 2016, 09:54:52

On 2016-02-10 10:32:44 +0530, Dilip Kumar wrote:
> On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> > > Could you also measure how this behaves for an INSERT instead of a COPY
> > > workload?
> >
> >
> I think such a test will be useful.
> >
> 
> I have measured the performance with insert to see the behavior when it
> don't use strategy. I have tested multiple option, small tuple, big tuple,
> data fits in shared buffer and doesn't fit in shared buffer.

Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.

Greetings,

Andres Freund

Re: Relation extension scalability

From

Dilip Kumar

Date:

15 February 2016, 17:22:26

On Wed, Feb 10, 2016 at 3:24 PM, Andres Freund <andres@anarazel.de> wrote:

Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.

I have quickly measured the per transaction latency of one case
(I selected below case to find the worst case latency because in this case
we are extending by 1000 blocks and data doesn't fits in shared buffer)

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

shared_buffers=512B -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c 8 -j 8 -f copy_script -T -l 120 postgres

                    base         patch(extend 1000)
best             23245          3857
worst          236329     382859
Average     190303       35143

I have also attached the pgbench log files
patch_1000.tar --> log files with patch extend by 1000 blocks
base.tar --> log files with base code

From attached files we can see that very few transactions latency with patch is high(> 300,000) which is expected and that too when we are
extensing 1000 blocks, And we base code almost every transaction latency is hight (>200,000) that we can see that best case and average case latency is 1/5 with extend by 1000.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Relation extension scalability

From

Dilip Kumar

Date:

29 February 2016, 10:07:50

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have tested Relation extension patch from various aspects and performance results and other statistical data are explained in the mail.

Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context Switch
1. I converted the RelationExtensionLock to simple LWLock and tested with single Relation. Results are as below

This is the simple script of copy 10000 record in one transaction of size 4 Bytes
client    base    lwlock    multi_extend by 50 block
1    155    156    160
2    282    276    284
4    248    319    428
8    161    267    675
16   143    241    889

LWLock performance is better than base, obvious reason may be because we have saved some instructions by converting to LWLock but it don't scales any better compared to base code.

Test2: Identify that improvement in case of multiextend is becuase of avoiding context switch or some other factor, like reusing blocks b/w backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend (Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer

Client    Base    Extend 800 block self use    Extend 1000 Block
1          117              131                                     118
2          111              203                                     140
3            51              242                                     178
4            51              231           190
5            52              259                                     224
6            51              263                                     243
7            43              253                                      254
8            43              240                                      254
16          40              190                                      243

We can see the same improvement in case of self using the blocks also, It shows that Sharing the blocks between the backend was not the WIN but avoiding context switch was the measure win.

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4 Bytes

BASE CODE PATCH MULTI EXTEND

Client    Base_TPS    ProcSleep Count        Extend By 10 Block    Proc Sleep Count
2                280                       457,506                        311              62,641
3                235                    1,098,701                        358                   141,624
4                216                    1,155,735                        368                   188,173

What we can see in above test that, in Base code performance is degrading after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Posting the re-based version and moving to next CF.

Open points:
1. After getting the Lock recheck the FSM if some other back end has already added extra blocks and reuse them.
2. Is it good idea to have user level parameter for extend_by_block or we can try some approach to internally identify how many blocks are needed and as per the need only add the blocks, this will make it more flexible.

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v4.patch

Re: Relation extension scalability

From

Amit Kapila

Date:

01 March 2016, 11:06:46

On Mon, Feb 29, 2016 at 3:37 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Test2: Identify that improvement in case of multiextend is becuase of avoiding context switch or some other factor, like reusing blocks b/w backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend (Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer

Client    Base    Extend 800 block self use    Extend 1000 Block
1          117              131                                     118
2          111              203                                     140
3            51              242                                     178
4            51              231           190
5            52              259                                     224
6            51              263                                     243
7            43              253                                      254
8            43              240                                      254
16          40              190                                      243

We can see the same improvement in case of self using the blocks also, It shows that Sharing the blocks between the backend was not the WIN but avoiding context switch was the measure win.

One thing that is slightly unclear is that whether there is any overhead due to buffer eviction especially when the buffer to be evicted is already dirty and needs XLogFlush(). One reason why it might not hurt is that by the time we tries to evict the buffer, corresponding WAL is already flushed by WALWriter or other possibility could be that even if it is getting done during buffer eviction, the impact for same is much lesser. Can we try to measure the number of flush calls which happen during buffer eviction?

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4 Bytes

                         BASE CODE                                           PATCH MULTI EXTEND
Client    Base_TPS    ProcSleep Count        Extend By 10 Block    Proc Sleep Count
2                280                       457,506                        311              62,641
3                235                    1,098,701                        358                   141,624
4                216                    1,155,735                        368                   188,173

What we can see in above test that, in Base code performance is degrading after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Okay. However, I wonder if the performance data for the case when data doesn't fit into shared buffer also shows similar trend, then it might be worth to try by doing extend w.r.t load in system. I mean to say we can batch the extension requests (as we have done in ProcArrayGroupClearXid) and extend accordingly, if that works out then the benefit could be that we don't need any configurable knob for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

02 March 2016, 05:01:37

On Tue, Mar 1, 2016 at 4:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that is slightly unclear is that whether there is any overhead due to buffer eviction especially when the buffer to be evicted is already dirty and needs XLogFlush(). One reason why it might not hurt is that by the time we tries to evict the buffer, corresponding WAL is already flushed by WALWriter or other possibility could be that even if it is getting done during buffer eviction, the impact for same is much lesser. Can we try to measure the number of flush calls which happen during buffer eviction?

Good Idea, I will do this test and post the results..

Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Okay. However, I wonder if the performance data for the case when data doesn't fit into shared buffer also shows similar trend, then it might be worth to try by doing extend w.r.t load in system. I mean to say we can batch the extension requests (as we have done in ProcArrayGroupClearXid) and extend accordingly, if that works out then the benefit could be that we don't need any configurable knob for the same.

1. One option can be as you suggested like ProcArrayGroupClearXid, With some modification, because when we wait for the request and extend w.r.t that, may be again we face the Context Switch problem, So may be we can extend in some multiple of the Request.

(But we need to take care whether to give block directly to requester or add it into FSM or do both i.e. give at-least one block to requester and put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and then speed up or slow down the extending speed.
Simple algorithm can look like this

If (GotBlockFromFSM())
Success++                  // We got the block from FSM
Else
Failure++                    // Did not get Block from FSM and need to extend by my self

Now while extending
-------------------------
Speed up
-------------
If (failure - success > Threshold ) // Threshold can be one number assume 10.
ExtendByBlock += failure- success;       --> If many failure means load is high then Increase the ExtendByBlock
Failure = success= 0                             --> reset after this so that we can measure the latest trend.

Slow down..
--------------
//Now its possible that demand of block is reduced but ExtendByBlock is still high.. So analyse statistic and slow down the extending pace..

If (success - failure > Threshold)

{
// Can not reduce it by big number because, may be more request are satisfying because this is correct amount, so gradually decrease the pace and re-analyse the statistics next time.
   ExtendByBlock --;
   Failure = success= 0

}

Any Suggestions are Welcome...

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

04 March 2016, 05:06:39

On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1. One option can be as you suggested like ProcArrayGroupClearXid, With some modification, because when we wait for the request and extend w.r.t that, may be again we face the Context Switch problem, So may be we can extend in some multiple of the Request.
(But we need to take care whether to give block directly to requester or add it into FSM or do both i.e. give at-least one block to requester and put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and then speed up or slow down the extending speed.
Simple algorithm can look like this

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one process will extend for all, Slight change from ProcArrayGroupClear is that here other than satisfying the requested backend we Add some extra blocks in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like group size tell us exactly the contention size and we extend in multiple of that.

Performance Analysis:
---------------------
Performance is scaling with this approach, its slightly less compared to previous patch where we directly give extend_by_block parameter and extend in multiple blocks, and I think that's obvious because in group extend case only after contention happen on lock we add extra blocks, but in former case it was extending extra blocks optimistically.

Test1(COPY)
-----
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB    max_wal_size:10GB      Storage:Magnetic Disk    WAL:SSD
-----------------------------------------------------------------------------------------------
Client    Base multi_extend by 20 page     group_extend_patch(groupsize*10)
1              105                157                               149
2              217                255             282
4              210                 494                              452
8              166                702                               629
16            145                 563                              561
32            124                 480                              566

Test2(INSERT)
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client        Base    Multi-extend by 1000    *group extend (group*10)    *group extend (group*100)
1                117             118            125                 122
2                111                140              161                 159
4                 51                     190                                        141                 187
8                43                      254                                        148                 172
16               40                     243                                        150                 173

* (group*10)-> means inside the code, Group leader will see how many members are in the group who want blocks, so we will satisfy request of all member + will put extra blocks in FSM extra block to extend are = (group*10) --> 10 is just some constant.

Summary:
------------

1. Here with group extend patch, there is no configuration to tell that how many block to extend, so that should be decided by current load in the system (contention on the extension lock).

2. With small multiplier i.e. 10 we can get fairly good improvement compare to base code, but when load is high like record size is 1K, improving the multiplier size giving better results.

* Note: Currently this is POC patch, It has only one group Extend List, so currently can handle only one relation Group extend.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_group.patch

Re: Relation extension scalability

From

Robert Haas

Date:

04 March 2016, 16:29:33

On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have tried the approach of group extend,
>
> 1. We convert the extension lock to TryLock and if we get the lock then
> extend by one block.2.
> 2. If don't get the Lock then use the Group leader concep where only one
> process will extend for all, Slight change from ProcArrayGroupClear is that
> here other than satisfying the requested backend we Add some extra blocks in
> FSM, say GroupSize*10.
> 3. So Actually we can not get exact load but still we have some factor like
> group size tell us exactly the contention size and we extend in multiple of
> that.

This approach seems good to me, and the performance results look very
positive.  The nice thing about this is that there is not a
user-configurable knob; the system automatically determines when
larger extensions are needed, which will mean that real-world users
are much more likely to benefit from this.  I don't think it matters
that this is a little faster or slower than an approach with a manual
knob; what matter is that it is a huge improvement over unpatched
master, and that it does not need a knob.  The arbitrary constant of
10 is a little unsettling but I think we can live with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Tom Lane

Date:

04 March 2016, 16:31:49

Robert Haas <robertmhaas@gmail.com> writes:
> This approach seems good to me, and the performance results look very
> positive.  The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this.  I don't think it matters
> that this is a little faster or slower than an approach with a manual
> knob; what matter is that it is a huge improvement over unpatched
> master, and that it does not need a knob.  The arbitrary constant of
> 10 is a little unsettling but I think we can live with it.

+1.  "No knob" is a huge win.
        regards, tom lane

Re: Relation extension scalability

From

Amit Kapila

Date:

05 March 2016, 04:49:42

On Fri, Mar 4, 2016 at 9:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have tried the approach of group extend,
> >
> > 1. We convert the extension lock to TryLock and if we get the lock then
> > extend by one block.2.
> > 2. If don't get the Lock then use the Group leader concep where only one
> > process will extend for all, Slight change from ProcArrayGroupClear is that
> > here other than satisfying the requested backend we Add some extra blocks in
> > FSM, say GroupSize*10.
> > 3. So Actually we can not get exact load but still we have some factor like
> > group size tell us exactly the contention size and we extend in multiple of
> > that.
>
> This approach seems good to me, and the performance results look very
> positive. The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this.

I think one thing which needs more thoughts about this approach is that we need to maintain some number of slots so that group extend for different relations can happen in parallel. Do we want to provide simultaneous extension for 1, 2, 3, 4, 5 or more number of relations? I think providing it for three or four relations should be okay as higher the number we want to provide, bigger the size of PGPROC structure will be.

+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState bistate)

+ volatile PROC_HDR *procglobal = ProcGlobal;

+ uint32 nextidx;

+ uint32 wakeidx;

+ int extraWaits = -1;

+ BlockNumber targetBlock;

+ int count = 0;

+ /* Add ourselves to the list of processes needing a group extend. */

+ proc->groupExtendMember = true;

+ /* We are the leader. Acquire the lock on behalf of everyone. */

+ LockRelationForExtension(relation, ExclusiveLock);

To provide it for multiple relations, I think you need to advocate the reloid for relation in each proc and then get the relation descriptor for relation extension lock.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

07 March 2016, 15:04:53

On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think one thing which needs more thoughts about this approach is that we
> need to maintain some number of slots so that group extend for different
> relations can happen in parallel.  Do we want to provide simultaneous
> extension for 1, 2, 3, 4, 5 or more number of relations?  I think providing
> it for three or four relations should be okay as higher the number we want
> to provide, bigger the size of PGPROC structure will be.

Hmm.  Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Amit Kapila

Date:

08 March 2016, 09:27:28

On Mon, Mar 7, 2016 at 8:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think one thing which needs more thoughts about this approach is that we
> > need to maintain some number of slots so that group extend for different
> > relations can happen in parallel. Do we want to provide simultaneous
> > extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
> > it for three or four relations should be okay as higher the number we want
> > to provide, bigger the size of PGPROC structure will be.
>
> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> how big the relation extension lock wait queue is, instead of adding
> more stuff to PGPROC?
>

One idea to make it work without adding additional stuff in PGPROC is that after acquiring relation extension lock, check if there is any available block in fsm, if it founds any block, then release the lock and proceed, else extend the relation by one block and then check lock's wait queue size or number of lock requests (nRequested) and extend the relation further in proportion to wait queue size and then release the lock and proceed. Here, I think we can check for wait queue size even before extending the relation by one block.

The benefit of doing it with PGPROC is that there will be relatively less number LockAcquire calls as compare to heavyweight lock approach, which I think should not matter much because we are planing to extend the relation in proportion to wait queue size (probably wait queue size * 10).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

08 March 2016, 13:53:20

On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Hmm.  Can we drive this off of the heavyweight lock manager's idea of
>> how big the relation extension lock wait queue is, instead of adding
>> more stuff to PGPROC?
>
> One idea to make it work without adding additional stuff in PGPROC is that
> after acquiring relation extension lock, check if there is any available
> block in fsm, if it founds any block, then release the lock and proceed,
> else extend the relation by one block and then check lock's wait queue size
> or number of lock requests (nRequested) and extend the relation further in
> proportion to wait queue size and then release the lock and proceed.  Here,
> I think we can check for wait queue size even before extending the relation
> by one block.
>
> The benefit of doing it with PGPROC is that there will be relatively less
> number LockAcquire calls as compare to heavyweight lock approach, which I
> think should not matter much because we are planing to extend the relation
> in proportion to wait queue size (probably wait queue size * 10).

I don't think switching relation extension from heavyweight locks to
lightweight locks is going to work.  It would mean, for example, that
we lose the ability to service interrupts while extending a relation;
not to mention that we lose scalability if many relations are being
extended at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Amit Kapila

Date:

08 March 2016, 15:04:14

On Tue, Mar 8, 2016 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue size
> > or number of lock requests (nRequested) and extend the relation further in
> > proportion to wait queue size and then release the lock and proceed. Here,
> > I think we can check for wait queue size even before extending the relation
> > by one block.
> >
> > The benefit of doing it with PGPROC is that there will be relatively less
> > number LockAcquire calls as compare to heavyweight lock approach, which I
> > think should not matter much because we are planing to extend the relation
> > in proportion to wait queue size (probably wait queue size * 10).
>
> I don't think switching relation extension from heavyweight locks to
> lightweight locks is going to work.

Sorry, but I am not suggesting to change it to lightweight locks. I am just suggesting how to make batching works with heavyweight locks as asked by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

08 March 2016, 16:21:01

On Tue, Mar 8, 2016 at 8:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue size
> > or number of lock requests (nRequested) and extend the relation further in
> > proportion to wait queue size and then release the lock and proceed. Here,
> > I think we can check for wait queue size even before extending the relation
> > by one block.

I have come up with this patch..

If this approach looks fine then I will prepare final patch (more comments, indentation, and improve some code) and do some long run testing (current results are 5 mins run).

Idea is same what Robert and Amit suggested up thread.

/* First we try the lock and if get just extend one block, this will give two benefit ,

1. Single thread performance will not impact by checking lock waiters and all

2. If we check the waiter in else part it will give time for more waiter to get collected and will get better estimation of contention*/

TryRelExtLock ()

{

extend one block

}

else

{

RelextLock()

if (recheck the FSM if somebody have added blocks for me)

-- don't extend any block just reuse

else

--we have to extend blocks

-- get the waiter = lock->nRequested

--add extra block to FSM extraBlock = waiter*20;

}

Result looks like this

---------------------------

Test1(COPY)

-----./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD

-----------------------------------------------------------------------------------------------

Client Base multi_extend by 20 page lock_waiter patch(waiter*20)

1 105 157 148

2 217 255 252

4 210 494 442

8 166 702 645

16 145 563 773

32 124 480 957

Note: @1 thread there should not be any improvement, so many be run to run variance.

Test2(INSERT)

--------./psql -d postgres -c "create table test_data(a int, b text)"./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"./psql -d postgres -c "create table data (a int, b text)

echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min

./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client Base Multi-extend by 1000 lock_waiter patch(waiter*20)

1 117 118 117

2 111 140 132

4 51 190 134

8 43 254 148

16 40 243 206

32 - - 264

* (waiter*20)-> First process got the lock will find the lock waiters and add waiter*20 extra blocks.

In next run I will run beyond 32 also, as we can see even at 32 client its increasing.. so its clear when it see more contentions, adapting to contention and adding more blocks..

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v5.patch

Re: Relation extension scalability

From

Robert Haas

Date:

08 March 2016, 20:10:01

On Tue, Mar 8, 2016 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have come up with this patch..
>
> If this approach looks fine then I will prepare final patch (more comments,
> indentation, and improve some code) and do some long run testing (current
> results are 5 mins run).
>
> Idea is same what Robert and Amit suggested up thread.

So this seems promising, but I think the code needs some work.

LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all.  That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance.  If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).

A few of the other functions in here also lack comments, and perhaps
should have them.

The changes to RelationGetBufferForTuple need a visit from the
refactoring police.  Look at the way you are calling
RelationAddOneBlock.  The logic looks about like this:

if (needLock)
{ if (trylock relation for extension)   RelationAddOneBlock(); else {   lock relation for extension;   if (use fsm)   {
   complicated;   }   else     RelationAddOneBlock();

}
else RelationAddOneBlock();

So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock.  See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.

Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

10 March 2016, 01:53:20

On Wed, Mar 9, 2016 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all. That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance. If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).

With LWLock also performance are equally good so added the lock.

A few of the other functions in here also lack comments, and perhaps
should have them.

The changes to RelationGetBufferForTuple need a visit from the
refactoring police. Look at the way you are calling
RelationAddOneBlock. The logic looks about like this:

if (needLock)
{
if (trylock relation for extension)
RelationAddOneBlock();
else
{
lock relation for extension;
if (use fsm)
{
complicated;
}
else
RelationAddOneBlock();
}
else
RelationAddOneBlock();

So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock. See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.

Actually in every case we need one blocks, So I have re factored it and RelationAddOneBlock is now out of any condition.

Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.

Done

Attaching a latest patch.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v6.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

10 March 2016, 02:25:18

On 10/03/16 02:53, Dilip Kumar wrote:
>
> Attaching a latest patch.
>

Hmm, why did you remove the comment above the call to 
UnlockRelationForExtension? It still seems relevant, maybe with some 
minor modification?

Also there is a bit of whitespace mess inside the conditional lock block.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Dilip Kumar

Date:

10 March 2016, 08:57:49

On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Thanks for the comments..

Hmm, why did you remove the comment above the call to UnlockRelationForExtension?

While re factoring I lose this comment.. Fixed it

It still seems relevant, maybe with some minor modification?

Also there is a bit of whitespace mess inside the conditional lock block.

Fixed

I got the result of 10 mins run so posting it..

Note: Base code results are copied from up thread...

Results For 10 Mins run of COPY 10000 records of size 4 bytes script and configuration are same as used in up thread
--------------------------------------------------------------------------------------------
Client    Base    Patch
1            105    111
2            217    246
4            210    428
8            166    653
16          145    808
32          124    988

64 --- 974

Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data don't fits in shared buffer)
--------------------------------------------------------------------------------------------------
Client    Base    Patch
1            117    120
2            111    126
4             51     130
8             43     147
16           40     209
32           ---      254
64           ---    205

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v7.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

10 March 2016, 18:34:47

On 10/03/16 09:57, Dilip Kumar wrote:
>
> On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr@2ndquadrant.com
> <mailto:petr@2ndquadrant.com>> wrote:
>
> Thanks for the comments..
>
>     Hmm, why did you remove the comment above the call to
>     UnlockRelationForExtension?
>
> While re factoring I lose this comment.. Fixed it
>
>     It still seems relevant, maybe with some minor modification?
>
>     Also there is a bit of whitespace mess inside the conditional lock
>     block.
>
> Fixed
>
> I got the result of 10 mins run so posting it..
> Note: Base code results are copied from up thread...
>
> Results For 10 Mins run of COPY 10000 records of size 4 bytes script and
> configuration are same as used in up thread
> --------------------------------------------------------------------------------------------
>
> Client    Base    Patch
> 1            105    111
> 2            217    246
> 4            210    428
> 8            166    653
> 16          145    808
> 32          124    988
> 64            ---    974
>
>
> Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
> don't fits in shared buffer)
> --------------------------------------------------------------------------------------------------
>
> Client    Base    Patch
> 1            117    120
> 2            111    126
> 4             51     130
> 8             43     147
> 16           40     209
> 32           ---      254
> 64           ---      205
>

Those look good. The patch looks good in general now. I am bit scared by 
the lockWaiters * 20 as it can result in relatively big changes in rare 
corner cases when for example a lot of backends were waiting for lock on 
relation and suddenly all try to extend it. I wonder if we should clamp 
it to something sane (although what's sane today might be small in 
couple of years).

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Dilip Kumar

Date:

11 March 2016, 01:44:33

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek
<spandir="ltr"><<a href="mailto:petr@2ndquadrant.com" target="_blank">petr@2ndquadrant.com</a>></span>
wrote:</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">Thanks for looking..</div><div
class="gmail_quote"><br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">Thoselook good. The patch looks good in general now. I am bit scared by the lockWaiters * 20 as
itcan result in relatively big changes in rare corner cases when for example a lot of backends were waiting for lock on
relationand suddenly all try to extend it. I wonder if we should clamp it to something sane (although what's sane today
mightbe small in couple of years).</blockquote></div><div class="gmail_extra"><br /></div>But in such condition when
allare waiting on lock, then at a time only one waiter will get the lock and that will easily count the lock waiter and
extendin multiple of that. And we also have the check that if any waiter get the lock it will first check in FSM that
anybodyelse have added one block or not..</div><div class="gmail_extra"><br /></div><div class="gmail_extra">And other
waiterwill not get lock unless first waiter extend all blocks and release the locks.</div><div class="gmail_extra"><br
/></div><divclass="gmail_extra">One possible case is as soon as we extend the blocks new requester directly find in FSM
anddon't come for lock, and old waiter after getting lock don't find in FSM, But IMHO in such cases, also its good that
otherwaiter also extend more blocks (because this can happen when request flow is very high).</div><div
class="gmail_extra"><br/></div><div class="gmail_extra"><br /></div><div class="gmail_extra">-- <br /><div
class="gmail_signature"><divdir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a
href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px"
target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>

Re: Relation extension scalability

From

Petr Jelinek

Date:

11 March 2016, 01:54:32

On 11/03/16 02:44, Dilip Kumar wrote:
>
> On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr@2ndquadrant.com
> <mailto:petr@2ndquadrant.com>> wrote:
>
> Thanks for looking..
>
>     Those look good. The patch looks good in general now. I am bit
>     scared by the lockWaiters * 20 as it can result in relatively big
>     changes in rare corner cases when for example a lot of backends were
>     waiting for lock on relation and suddenly all try to extend it. I
>     wonder if we should clamp it to something sane (although what's sane
>     today might be small in couple of years).
>
>
> But in such condition when all are waiting on lock, then at a time only
> one waiter will get the lock and that will easily count the lock waiter
> and extend in multiple of that. And we also have the check that if any
> waiter get the lock it will first check in FSM that anybody else have
> added one block or not..
>

I am not talking about extension locks, the lock queue can be long 
because there is concurrent DDL for example and then once DDL finishes 
suddenly 100 connections that tried to insert into table will try to get 
extension lock and this will add 2000 new pages when much fewer was 
actually needed. I guess that's fine as it's corner case and it's only 
16MB even in such extreme case.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

11 March 2016, 21:29:46

On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> I am not talking about extension locks, the lock queue can be long because
> there is concurrent DDL for example and then once DDL finishes suddenly 100
> connections that tried to insert into table will try to get extension lock
> and this will add 2000 new pages when much fewer was actually needed. I
> guess that's fine as it's corner case and it's only 16MB even in such
> extreme case.

I don't really understand this part about concurrent DDL.  If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL.  But I guess it could happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Petr Jelinek

Date:

11 March 2016, 23:14:42

On 11/03/16 22:29, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> I am not talking about extension locks, the lock queue can be long because
>> there is concurrent DDL for example and then once DDL finishes suddenly 100
>> connections that tried to insert into table will try to get extension lock
>> and this will add 2000 new pages when much fewer was actually needed. I
>> guess that's fine as it's corner case and it's only 16MB even in such
>> extreme case.
>
> I don't really understand this part about concurrent DDL.  If there
> were concurrent DDL going on, presumably other backends would be
> blocked on the relation lock, not the relation extension lock - and it
> doesn't seem likely that you'd often have a huge pile-up of inserters
> waiting on concurrent DDL.  But I guess it could happen.
>

Yeah I was thinking about the latter part and as I said it's very rare 
case, but I did see something similar couple of times in the wild. It's 
not objection against committing this patch though, in fact I think it 
can be committed as is.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Jim Nasby

Date:

12 March 2016, 00:01:19

On 3/11/16 5:14 PM, Petr Jelinek wrote:
>> I don't really understand this part about concurrent DDL.  If there
>> were concurrent DDL going on, presumably other backends would be
>> blocked on the relation lock, not the relation extension lock - and it
>> doesn't seem likely that you'd often have a huge pile-up of inserters
>> waiting on concurrent DDL.  But I guess it could happen.
>>
>
> Yeah I was thinking about the latter part and as I said it's very rare
> case, but I did see something similar couple of times in the wild. It's
> not objection against committing this patch though, in fact I think it
> can be committed as is.

FWIW, this is definitely a real possibility in any shop that has very 
high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon 
to run max_connections significantly higher than 100, so the extension 
could be way larger than 16MB. In those cases this patch could actually 
make things far worse as everyone backs up waiting on the OS to extend 
many MB when all you actually needed were a couple dozen more pages.

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent 
demand for extension that means you're running lots of small DML 
operations, not really big ones. I'd think that would make *1 more 
appropriate.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

12 March 2016, 02:47:06

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages.

I agree, We can have some max limit on number of extra pages, What other thinks ?

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate.

*1 will not solve this problem, Here the main problem was many people are sleep/wakeup on the extension lock and that was causing the bottleneck. So if we do *1 this will satisfy only current requesters which has already waited on the lock. But our goal is to avoid backends from requesting this lock.

Idea of Finding the requester to get the statistics on this locks (load on the lock) and extend in multiple of load so that in future this situation will be avoided for long time and again when happen next time extend in multiple of load.

How 20 comes ?

I tested with Multiple clients loads 1..64, with multiple load size 4 byte records to 1KB Records, COPY/ INSERT and found 20 works best.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Amit Kapila

Date:

12 March 2016, 03:08:05

On Sat, Mar 12, 2016 at 8:16 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates.
>>
>> I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other thinks ?
>
>
>>
>> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate.
>
>
> *1 will not solve this problem, Here the main problem was many people are sleep/wakeup on the extension lock and that was causing the bottleneck. So if we do *1 this will satisfy only current requesters which has already waited on the lock. But our goal is to avoid backends from requesting this lock.
>
> Idea of Finding the requester to get the statistics on this locks (load on the lock) and extend in multiple of load so that in future this situation will be avoided for long time and again when happen next time extend in multiple of load.
>
> How 20 comes ?
> I tested with Multiple clients loads 1..64, with multiple load size 4 byte records to 1KB Records, COPY/ INSERT and found 20 works best.
>

Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier you have tried, so that it is clear that 20 is best?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Petr Jelinek

Date:

12 March 2016, 03:57:17

On 12/03/16 01:01, Jim Nasby wrote:
> On 3/11/16 5:14 PM, Petr Jelinek wrote:
>>> I don't really understand this part about concurrent DDL.  If there
>>> were concurrent DDL going on, presumably other backends would be
>>> blocked on the relation lock, not the relation extension lock - and it
>>> doesn't seem likely that you'd often have a huge pile-up of inserters
>>> waiting on concurrent DDL.  But I guess it could happen.
>>>
>>
>> Yeah I was thinking about the latter part and as I said it's very rare
>> case, but I did see something similar couple of times in the wild. It's
>> not objection against committing this patch though, in fact I think it
>> can be committed as is.
>
> FWIW, this is definitely a real possibility in any shop that has very
> high downtime costs and high transaction rates.
>
> I also think some kind of clamp is a good idea. It's not that uncommon
> to run max_connections significantly higher than 100, so the extension
> could be way larger than 16MB. In those cases this patch could actually
> make things far worse as everyone backs up waiting on the OS to extend
> many MB when all you actually needed were a couple dozen more pages.
>

Well yeah I've seen 10k, but not everything will write to same table, 
wanted to stay in realms of something that has realistic chance of 
happening.

> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
> demand for extension that means you're running lots of small DML
> operations, not really big ones. I'd think that would make *1 more
> appropriate.

The benchmarks I've seen showed you want at least *10 and *20 was better.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Petr Jelinek

Date:

12 March 2016, 03:57:58

On 12/03/16 03:46, Dilip Kumar wrote:
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com
> <mailto:Jim.Nasby@bluetreble.com>> wrote:
>
>     FWIW, this is definitely a real possibility in any shop that has
>     very high downtime costs and high transaction rates.
>
>     I also think some kind of clamp is a good idea. It's not that
>     uncommon to run max_connections significantly higher than 100, so
>     the extension could be way larger than 16MB. In those cases this
>     patch could actually make things far worse as everyone backs up
>     waiting on the OS to extend many MB when all you actually needed
>     were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other
> thinks ?
>

Well, that's what I meant with clamping originally. I don't know what is 
a good value though.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Dilip Kumar

Date:

12 March 2016, 06:24:24

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila
<spandir="ltr"><<a href="mailto:amit.kapila16@gmail.com" target="_blank">amit.kapila16@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">Canyou post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier you have
tried,so that it is clear that 20 is best?</blockquote></div><br /></div><div class="gmail_extra">I had Tried with 1,
10,20 and 50.<br /><br /></div><div class="gmail_extra">1. With base code it was almost the same as base code.<br
/></div><divclass="gmail_extra"><br />2. With 10 thread data it matching with my previous group extend patch data
postedupthread<br /><a
href="http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com">http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com</a><br
/><br/></div><div class="gmail_extra">3. Beyond 20 with 50 I did not see any extra benefit in performance number
(comparedto 20 when tested 50 with 4 byte COPY I did not tested other data size with 50.).<br /><br />-- <br /><div
class="gmail_signature"><divdir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a
href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px"
target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>

Re: Relation extension scalability

From

Jim Nasby

Date:

13 March 2016, 23:33:12

On 3/11/16 9:57 PM, Petr Jelinek wrote:
>>     I also think some kind of clamp is a good idea. It's not that
>>     uncommon to run max_connections significantly higher than 100, so
>>     the extension could be way larger than 16MB. In those cases this
>>     patch could actually make things far worse as everyone backs up
>>     waiting on the OS to extend many MB when all you actually needed
>>     were a couple dozen more pages.
>>
>>
>> I agree, We can have some max limit on number of extra pages, What other
>> thinks ?
>>
>
> Well, that's what I meant with clamping originally. I don't know what is
> a good value though.

Well, 16MB is 2K pages, which is what you'd get if 100 connections were 
all blocked and we're doing 20 pages per waiter. That seems like a 
really extreme scenario, so maybe 4MB is a good compromise. That's 
unlikely to be hit in most cases, unlikely to put a ton of stress on IO, 
even with magnetic media (assuming the whole 4MB is queued to write in 
one shot...). 4MB would still reduce the number of locks by 500x.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

14 March 2016, 02:29:58

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <span
dir="ltr"><<ahref="mailto:Jim.Nasby@bluetreble.com" target="_blank">Jim.Nasby@bluetreble.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well,
16MBis 2K pages, which is what you'd get if 100 connections were all blocked and we're doing 20 pages per waiter. That
seemslike a really extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be hit in most cases,
unlikelyto put a ton of stress on IO, even with magnetic media (assuming the whole 4MB is queued to write in one
shot...).4MB would still reduce the number of locks by 500x.</blockquote></div><br /></div><div class="gmail_extra">In
myperformance results given up thread, we are getting max performance at 32 clients, means at a time we are extending
32*20~= max (600) pages at a time. So now with 4MB limit (max 512 pages) Results will looks similar. So we need to take
adecision whether 4MB is good limit, should I change it ?<br /><br /><br />-- <br /><div class="gmail_signature"><div
dir="ltr"><spanstyle="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a
href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px"
target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>

Re: Relation extension scalability

From

Petr Jelinek

Date:

14 March 2016, 02:56:51

On 14/03/16 03:29, Dilip Kumar wrote:
>
> On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com
> <mailto:Jim.Nasby@bluetreble.com>> wrote:
>
>     Well, 16MB is 2K pages, which is what you'd get if 100 connections
>     were all blocked and we're doing 20 pages per waiter. That seems
>     like a really extreme scenario, so maybe 4MB is a good compromise.
>     That's unlikely to be hit in most cases, unlikely to put a ton of
>     stress on IO, even with magnetic media (assuming the whole 4MB is
>     queued to write in one shot...). 4MB would still reduce the number
>     of locks by 500x.
>
>
> In my performance results given up thread, we are getting max
> performance at 32 clients, means at a time we are extending 32*20 ~= max
> (600) pages at a time. So now with 4MB limit (max 512 pages) Results
> will looks similar. So we need to take a decision whether 4MB is good
> limit, should I change it ?
>
>

Well any value we choose will be very arbitrary. If we look at it from 
the point of maximum absolute disk space we allocate for relation at 
once, the 4MB limit would represent 2.5 orders of magnitude change. That 
sounds like enough for one release cycle, I think we can further tune it 
if the need arises in next one. (with my love for round numbers I would 
have suggested 8MB as that's 3 orders of magnitude, but I am fine with 
4MB as well)

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Dilip Kumar

Date:

17 March 2016, 03:42:35

On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Well any value we choose will be very arbitrary. If we look at it from the point of maximum absolute disk space we allocate for relation at once, the 4MB limit would represent 2.5 orders of magnitude change. That sounds like enough for one release cycle, I think we can further tune it if the need arises in next one. (with my love for round numbers I would have suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as well)

I have modified the patch, this contains the max limit on extra pages, 512(4MB) pages is the max limit.

I have measured the performance also and that looks equally good.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v8.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

17 March 2016, 08:01:36

On 17/03/16 04:42, Dilip Kumar wrote:
>
> On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr@2ndquadrant.com
> <mailto:petr@2ndquadrant.com>> wrote:
>
>     Well any value we choose will be very arbitrary. If we look at it
>     from the point of maximum absolute disk space we allocate for
>     relation at once, the 4MB limit would represent 2.5 orders of
>     magnitude change. That sounds like enough for one release cycle, I
>     think we can further tune it if the need arises in next one. (with
>     my love for round numbers I would have suggested 8MB as that's 3
>     orders of magnitude, but I am fine with 4MB as well)
>
>
> I have modified the patch, this contains the max limit on extra pages,
> 512(4MB) pages is the max limit.
>
> I have measured the performance also and that looks equally good.
>

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Dilip Kumar

Date:

18 March 2016, 09:08:25

On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

Done, new patch attached.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v9.patch

Re: Relation extension scalability

From

Amit Kapila

Date:

21 March 2016, 12:10:14

On Fri, Mar 18, 2016 at 2:38 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>>
>> Great.
>>
>> Just small notational thing, maybe this would be simpler?:
>> extraBlocks = Min(512, lockWaiters * 20);
>
>
> Done, new patch attached.
>

Review comments:

+ * RelationAddOneBlock

+ *

+ * Extend relation by one block and lock the buffer

+ */

+static Buffer

+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)

Shall we mention in comments that this API returns locked buffer and it's the responsibility of caller to unlock it.

+ /* To avoid the cases when there are huge number of lock waiters, and

+ * extend file size by big amount at a

time, put some limit on the

first line in multi-line comments should not contain anything.

+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in above calculation.

4. Sometime back [1], you agreed on doing some analysis for the overhead that XLogFlush can cause during buffer eviction, but I don't see the results of same, are you planing to complete the same?

+ if (LOCKACQUIRE_OK

+ != RelationExtensionLockConditional(relation,

ExclusiveLock))

I think the coding style is to keep constant on right side of condition, did you see any other place in code which uses the check in a similar way?

- /*

- * Now acquire lock on the new page.

+ /* For all case we need to add at least one block to satisfy

our

+ * own request.

Same problem as in point 2.

@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,

if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

Spurious newline addition.

+int

+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

10.

+ if (use_fsm)

+ {

+ if (lastValidBlock !=

InvalidBlockNumber)

+ {

+ targetBlock =

RecordAndGetPageWithFreeSpace(relation,

lastValidBlock,

pageFreeSpace,

len + saveFreeSpace);

+ }

Are you using RecordAndGetPageWithFreeSpace() instead of GetPageWithFreeSpace() to get the page close to the previous target page? If yes, then do you see enough benefit of the same that it can compensate the additional write operation which Record* API might cause due to additional dirtying of buffer?

11.

+ {

+ /*

+ * First try to get the lock in no-wait mode, if succeed extend one

* block, else get the lock in normal mode and after we get the lock

+ * extend some extra blocks, extra

blocks will be added to satisfy

+ * request of other waiters and avoid future contention. Here instead

* of directly taking lock we try no-wait mode, this is to handle the

+ * case, when there is no

contention - it should not find the lock

+ * waiter and execute extra instructions.

+ */

if (LOCKACQUIRE_OK

+ != RelationExtensionLockConditional(relation, ExclusiveLock))

{

+ LockRelationForExtension(relation, ExclusiveLock);

+ if (use_fsm)

{

+ if (lastValidBlock != InvalidBlockNumber)

{

+ targetBlock = RecordAndGetPageWithFreeSpace(relation,

lastValidBlock,

pageFreeSpace,

len + saveFreeSpace);

}

+ /* Other waiter has extended the block for us*/

if (targetBlock != InvalidBlockNumber)

+ {

UnlockRelationForExtension(relation, ExclusiveLock);

+ goto loop;

}

+ RelationAddExtraBlocks(relation, bistate);

+ }

}

+ }

- /*

- * Now acquire lock on the new page.

+ /* For all case we need to add at least one block to

satisfy our

+ * own request.

- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);

+ buffer =

RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the relation for multiple blocks, what is the need of same?

12. I think it is good to once test pgbench read-write tests to ensure that this doesn't introduce any new regression.

[1] - http://www.postgresql.org/message-id/CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

22 March 2016, 09:15:21

On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Review comments:

Thanks for the review, Please find my response inline..

1.
/*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)

Shall we mention in comments that this API returns locked buffer and it's the responsibility of caller to unlock it.

Fixed

2.
+ /* To avoid the cases when there are huge number of lock waiters, and
+ * extend file size by big amount at a
time, put some limit on the

first line in multi-line comments should not contain anything.

Fixed

3.
+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in above calculation.

Fixed, Just check explanation is enough or we need to add something more ?

4. Sometime back [1], you agreed on doing some analysis for the overhead that XLogFlush can cause during buffer eviction, but I don't see the results of same, are you planing to complete the same?

Ok, I will test this..

5.
+ if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation,
ExclusiveLock))

I think the coding style is to keep constant on right side of condition, did you see any other place in code which uses the check in a similar way?

Fixed,

Not sure about any other place. (This is what I used to follow to keep constant on left side to avoid the cases, where instead == by mistake if we have given =, then it will do assignment instead throwing error).

But In PG style constant should be on right, so I will take care.

6.
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to satisfy
our
+ * own request.
*/

Same problem as in point 2.

Fixed.

7.
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

+
+

Spurious newline addition.

Fixed

8.
+int
+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

Done

9.
+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

Fixed

10.
+ if (use_fsm)
+ {
+ if (lastValidBlock !=
InvalidBlockNumber)
+ {
+ targetBlock =
RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+ }

Are you using RecordAndGetPageWithFreeSpace() instead of GetPageWithFreeSpace() to get the page close to the previous target page? If yes, then do you see enough benefit of the same that it can compensate the additional write operation which Record* API might cause due to additional dirtying of buffer?

Here we are calling RecordAndGetPageWithFreeSpace instead of GetPageWithFreeSpace, because other backend who have got the lock might have added extra block in the FSM and its possible that FSM tree might not have been updated so far and we will not get the page by searching from top using GetPageWithFreeSpace, so we need to search the leaf page directly using our last valid target block.

Explained same in the comments...

11.
+ {
+ /*
+ * First try to get the lock in no-wait mode, if succeed extend one
+
* block, else get the lock in normal mode and after we get the lock
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the relation for multiple blocks, what is the need of same?

This is the block for this backend, Extra extend for future request and already added to FSM. I could have added this count along with extra block in RelationAddExtraBlocks, But In that case I need to put some extra If for saving one buffer for this bakend and then returning that the specific buffer to caller, and In caller also need to distinguish between who wants to add one block or who have got one block added in along with extra block.

I think this way code is simple.. That everybody comes down will add one block for self use. and all other functionality and logic is above, i.e. wether to take lock or not, whether to add extra blocks or not..

12. I think it is good to once test pgbench read-write tests to ensure that this doesn't introduce any new regression.

I will test this and post the results..

Latest patch attached..

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v10.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

22 March 2016, 17:13:02

On 22/03/16 10:15, Dilip Kumar wrote:
>
> On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit.kapila16@gmail.com
> <mailto:amit.kapila16@gmail.com>> wrote:
>     11.
>     +{
>     +/*
>     +* First try to get the lock in no-wait mode, if succeed extend one
>     +
>       * block, else get the lock in normal mode and after we get the lock
>     -LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>     +buffer =
>     RelationAddOneBlock(relation, otherBuffer, bistate);
>
>     Won't this cause one extra block addition after backend extends the
>     relation for multiple blocks, what is the need of same?
>
>
> This is the block for this backend, Extra extend for future request and
> already added to FSM. I could have added this count along with extra
> block in RelationAddExtraBlocks, But In that case I need to put some
> extra If for saving one buffer for this bakend and then returning that
> the specific buffer to caller, and In caller also need to distinguish
> between who wants to add one block or who have got one block added in
> along with extra block.
>
> I think this way code is simple.. That everybody comes down will add one
> block for self use. and all other functionality and logic is above, i.e.
> wether to take lock or not, whether to add extra blocks or not..
>
>

I also think the code simplicity makes this worth it.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

23 March 2016, 18:40:06

On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> I also think the code simplicity makes this worth it.

Agreed.  I went over this patch and did a cleanup pass today.   I
discovered that the LockWaiterCount() function was broken if you try
to tried to use it on a lock that you didn't hold or a lock that you
held in any mode other than exclusive, so I tried to fix that.  I
rewrote a lot of the comments and tightened some other things up.  The
result is attached.

I'm baffled by the same code Amit asked about upthread, even though
there's now a comment:

+                               /*
+                                * Here we are calling
RecordAndGetPageWithFreeSpace
+                                * instead of GetPageWithFreeSpace,
because other backend
+                                * who have got the lock might have
added extra blocks in
+                                * the FSM and its possible that FSM
tree might not have
+                                * been updated so far and we will not
get the page by
+                                * searching from top using
GetPageWithFreeSpace, so we
+                                * need to search the leaf page
directly using our last
+                                * valid target block.
+                                *
+                                * XXX. I don't understand what is
happening here. -RMH
+                                */

I've read this over several times and looked at
RecordAndGetPageWithFreeSpace() and I'm still confused.  First of all,
if the lock was acquired by some other backend which did
RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
whole point.  Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help?  As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too.  What
am I missing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

multi_extend_v11.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

23 March 2016, 18:52:33

On 23/03/16 19:39, Robert Haas wrote:
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> I also think the code simplicity makes this worth it.
>
> Agreed.  I went over this patch and did a cleanup pass today.   I
> discovered that the LockWaiterCount() function was broken if you try
> to tried to use it on a lock that you didn't hold or a lock that you
> held in any mode other than exclusive, so I tried to fix that.  I
> rewrote a lot of the comments and tightened some other things up.  The
> result is attached.
>
> I'm baffled by the same code Amit asked about upthread, even though
> there's now a comment:
>
> +                               /*
> +                                * Here we are calling
> RecordAndGetPageWithFreeSpace
> +                                * instead of GetPageWithFreeSpace,
> because other backend
> +                                * who have got the lock might have
> added extra blocks in
> +                                * the FSM and its possible that FSM
> tree might not have
> +                                * been updated so far and we will not
> get the page by
> +                                * searching from top using
> GetPageWithFreeSpace, so we
> +                                * need to search the leaf page
> directly using our last
> +                                * valid target block.
> +                                *
> +                                * XXX. I don't understand what is
> happening here. -RMH
> +                                */
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused.  First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.

That's good point, maybe this coding is bit too defensive.

> Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help?  As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too.  What
> am I missing?
>

The RecordAndGetPageWithFreeSpace will extend FSM as it calls 
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

23 March 2016, 19:01:49

On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> Second, if the other backend extended the relation in
>> some other manner and did not extend the FSM, how does calling
>> RecordAndGetPageWithFreeSpace help?  As far as I can see,
>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>> searching the FSM, so if one is stymied the other will be too.  What
>> am I missing?
>>
>
> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

So how does that help?  If I'm reading this right, the new block will
be all zeroes which means no space available on any of those pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Petr Jelinek

Date:

23 March 2016, 19:19:15

On 23/03/16 20:01, Robert Haas wrote:
> On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>>> Second, if the other backend extended the relation in
>>> some other manner and did not extend the FSM, how does calling
>>> RecordAndGetPageWithFreeSpace help?  As far as I can see,
>>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>>> searching the FSM, so if one is stymied the other will be too.  What
>>> am I missing?
>>>
>>
>> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
>> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.
>
> So how does that help?  If I'm reading this right, the new block will
> be all zeroes which means no space available on any of those pages.
>

I am bit confused as to what exactly you are saying, but what will 
happen is we get back to the while cycle and try again so eventually we 
should find either block with enough free space or add new one (not sure 
if this would actually ever happen in practice in heavily concurrent 
workload where the FSM would not be correctly extended during relation 
extension though, we might just loop here forever).

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Petr Jelinek

Date:

23 March 2016, 19:33:50

On 23/03/16 20:19, Petr Jelinek wrote:
> On 23/03/16 20:01, Robert Haas wrote:
>> On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com>
>> wrote:
>>>> Second, if the other backend extended the relation in
>>>> some other manner and did not extend the FSM, how does calling
>>>> RecordAndGetPageWithFreeSpace help?  As far as I can see,
>>>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
>>>> searching the FSM, so if one is stymied the other will be too.  What
>>>> am I missing?
>>>>
>>>
>>> The RecordAndGetPageWithFreeSpace will extend FSM as it calls
>>> fsm_set_and_search which in turn calls fsm_readbuf with extend = true.
>>
>> So how does that help?  If I'm reading this right, the new block will
>> be all zeroes which means no space available on any of those pages.
>>
>
> I am bit confused as to what exactly you are saying, but what will
> happen is we get back to the while cycle and try again so eventually we
> should find either block with enough free space or add new one (not sure
> if this would actually ever happen in practice in heavily concurrent
> workload where the FSM would not be correctly extended during relation
> extension though, we might just loop here forever).
>

Btw thinking about it some more, ISTM that not finding the block and 
just doing the extension if the FSM wasn't extended correctly previously 
is probably cleaner behavior than what we do now. The reasoning for that 
opinion is that if the FSM wasn't extended, we'll fix it by doing 
relation extension since we know we do both in this code path and also 
if we could not find page before we'll most likely not find one even on 
retry and if there was page added at the end by extension that we might 
reuse partially here then there is no harm in adding new one anyway as 
the whole point of this patch is that it does bigger extension that 
strictly necessary so insisting on page reuse for something that seems 
like only theoretical possibility that does not even exist in current 
code does not seem right.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

23 March 2016, 19:43:18

On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> Btw thinking about it some more, ISTM that not finding the block and just
> doing the extension if the FSM wasn't extended correctly previously is
> probably cleaner behavior than what we do now. The reasoning for that
> opinion is that if the FSM wasn't extended, we'll fix it by doing relation
> extension since we know we do both in this code path and also if we could
> not find page before we'll most likely not find one even on retry and if
> there was page added at the end by extension that we might reuse partially
> here then there is no harm in adding new one anyway as the whole point of
> this patch is that it does bigger extension that strictly necessary so
> insisting on page reuse for something that seems like only theoretical
> possibility that does not even exist in current code does not seem right.

I'm not sure I completely follow this.  The fact that the last
sentence is 9 lines long may be related.  :-)

I think it's pretty clearly important to re-check the FSM after
acquiring the extension lock.  Otherwise, imagine that 25 backends
arrive at the exact same time.  The first one gets the lock and
extends the relation 500 pages; the next one, 480, and so on.  In
total, they extend the relation by 6500 pages, which is a bit rich.
Rechecking the FSM after acquiring the lock prevents that from
happening, and that's a very good thing.  We'll do one 500-page
extension and that's it.

However, I don't think using RecordAndGetPageWithFreeSpace rather than
GetPageWithFreeSpace is appropriate.  We've already recorded free
space on that page, and recording it again is a bad idea.  It's quite
possible that by the time we get the lock our old value is totally
inaccurate.  If there's some advantage to searching in the more
targeted way that RecordAndGetPageWithFreeSpace does over
GetPageWithFreeSpace then we need a new API into the freespace stuff
that does the more targeted search without updating anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Petr Jelinek

Date:

23 March 2016, 20:01:14

On 23/03/16 20:43, Robert Haas wrote:
> On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>> Btw thinking about it some more, ISTM that not finding the block and just
>> doing the extension if the FSM wasn't extended correctly previously is
>> probably cleaner behavior than what we do now. The reasoning for that
>> opinion is that if the FSM wasn't extended, we'll fix it by doing relation
>> extension since we know we do both in this code path and also if we could
>> not find page before we'll most likely not find one even on retry and if
>> there was page added at the end by extension that we might reuse partially
>> here then there is no harm in adding new one anyway as the whole point of
>> this patch is that it does bigger extension that strictly necessary so
>> insisting on page reuse for something that seems like only theoretical
>> possibility that does not even exist in current code does not seem right.
>
> I'm not sure I completely follow this.  The fact that the last
> sentence is 9 lines long may be related.  :-)
>

I tend to do that sometimes :)

> I think it's pretty clearly important to re-check the FSM after
> acquiring the extension lock.  Otherwise, imagine that 25 backends
> arrive at the exact same time.  The first one gets the lock and
> extends the relation 500 pages; the next one, 480, and so on.  In
> total, they extend the relation by 6500 pages, which is a bit rich.
> Rechecking the FSM after acquiring the lock prevents that from
> happening, and that's a very good thing.  We'll do one 500-page
> extension and that's it.
>

Right, but that would only happen if all the backends did it using 
different code which does not do the FSM extension because the current 
code does FSM extension and the point of using 
RecordAndGetPageWithFreeSpace seems to be "just in case" somebody is 
doing extension differently (at least I don't see other reason). So 
basically I am not saying we shouldn't do the search but that I agree 
GetPageWithFreeSpace should be enough as the worst that can happen is 
that we overextend the relation in case some theoretical code from 
somewhere else also did extension of relation without extending FSM 
(afaics).

But maybe Dilip had some other reason for using the 
RecordAndGetPageWithFreeSpace that is not documented.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Amit Kapila

Date:

24 March 2016, 01:44:01

On Thu, Mar 24, 2016 at 12:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.
>

It doesn't update the FSM uptill root in some cases, as per comments on top of RecordPageWithFreeSpace and the code as well.

>
> Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help? As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too. What
> am I missing?
>

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to it, rather than from top, so even if RecordPageWithFreeSpace() doesn't update till root, it will be able to search the newly added page. I agree with whatever you have said in another mail that we should introduce a new API to do a more targeted search for such cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

24 March 2016, 02:44:45

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
> it, rather than from top, so even if  RecordPageWithFreeSpace() doesn't
> update till root, it will be able to search the newly added page.  I agree
> with whatever you have said in another mail that we should introduce a new
> API to do a more targeted search for such cases.

OK, let's do that, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

24 March 2016, 06:04:27

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
> it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
> update till root, it will be able to search the newly added page. I agree
> with whatever you have said in another mail that we should introduce a new
> API to do a more targeted search for such cases.

OK, let's do that, then.

Ok, I have added new API which just find the free block from and start search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I don't like this name, but I could not come up with any better, Please suggest one.

And also body of GetPageWithFreeSpaceUsingOldPage looks almost similar to RecordAndGetPageWithFreeSpace, I tried to merge these two but for that we need to pass extra parameter to the function.

2. I also had to write one more function fsm_search_from_addr instead of using fsm_set_and_search. So that we can find block without updating the other slot.

I have done performance test just to ensure the result. And performance is same as old. with both COPY and INSERT.

3. I have also run pgbench read-write what amit suggested upthread.. No regression or improvement with pgbench workload.

Client base Patch

1 899 914

8 5397 5413

32 18170 18052

64 29850 29941

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v12.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

24 March 2016, 08:19:07

On 24/03/16 07:04, Dilip Kumar wrote:
>
> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>> wrote:
>
>     On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
>     <amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote:
>     > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
>     > it, rather than from top, so even if  RecordPageWithFreeSpace() doesn't
>     > update till root, it will be able to search the newly added page.  I agree
>     > with whatever you have said in another mail that we should introduce a new
>     > API to do a more targeted search for such cases.
>
>     OK, let's do that, then.
>
>
> Ok, I have added new API which just find the free block from and start
> search from last given page.
>
> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
> don't like this name, but I could not come up with any better, Please
> suggest one.
>

GetNearestPageWithFreeSpace? (although not sure that's accurate 
description, maybe Nearby would be better)

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Amit Kapila

Date:

24 March 2016, 10:13:46

On Thu, Mar 24, 2016 at 1:48 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>
> On 24/03/16 07:04, Dilip Kumar wrote:
>>
>>
>> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com
>> <mailto:robertmhaas@gmail.com>> wrote:
>>
>> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
>> <amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote:
>> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
>> > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
>> > update till root, it will be able to search the newly added page. I agree
>> > with whatever you have said in another mail that we should introduce a new
>> > API to do a more targeted search for such cases.
>>
>> OK, let's do that, then.
>>
>>
>> Ok, I have added new API which just find the free block from and start
>> search from last given page.
>>
>> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
>> don't like this name, but I could not come up with any better, Please
>> suggest one.
>>
>

+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,

+ Size spaceNeeded)

{

+ /*

+ * If fsm_set_and_search found a suitable new block, return that.

+ * Otherwise, search as usual.

+ */

}

In the above comment, you are referring wrong function.

+static int

+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)

+ Buffer buf;

+ int newslot = -1;

+ buf = fsm_readbuf(rel, addr, true);

+ LockBuffer(buf, BUFFER_LOCK_SHARE);

+ if (minValue != 0)

+ {

+ /* Search while we still hold the lock */

+ newslot = fsm_search_avail(buf, minValue,

+ addr.level == FSM_BOTTOM_LEVEL,

+ false);

In this new API, I don't understand why we need minValue != 0 check, basically if user of API doesn't want to search for space > 0, then what is the need of calling this API? I think this API should use Assert for minValue!=0 unless you see reason for not doing so.

>
> GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better)
>

Better than what is used in patch.

Yet another possibility could be to call it as GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with value of oldPage as InvalidBlockNumber.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

24 March 2016, 11:17:29

On Thu, Mar 24, 2016 at 6:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

1.
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+ Size spaceNeeded)
{
..
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
..
}

In the above comment, you are referring wrong function.

Fixed

2.
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+ Buffer buf;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+ addr.level == FSM_BOTTOM_LEVEL,
+ false);

In this new API, I don't understand why we need minValue != 0 check, basically if user of API doesn't want to search for space > 0, then what is the need of calling this API? I think this API should use Assert for minValue!=0 unless you see reason for not doing so.

Agree, it should be assert.

>
> GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better)
>

Better than what is used in patch.

Yet another possibility could be to call it as GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with value of oldPage as InvalidBlockNumber.

Yes I like this.. Changed the same.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v13.patch

Re: Relation extension scalability

From

Robert Haas

Date:

24 March 2016, 21:34:58

On Thu, Mar 24, 2016 at 7:17 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Yet another possibility could be to call it as
>> GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
>> value of oldPage as InvalidBlockNumber.
>
> Yes I like this.. Changed the same.

After thinking about this some more, I don't think this is the right
approach.  I finally understand what's going on here:
RecordPageWithFreeSpace updates the FSM lazily, only adjusting the
leaves and not the upper levels.  It relies on VACUUM to update the
upper levels.  This seems like it might be a bad policy in general,
because VACUUM on a very large relation may be quite infrequent, and
you could lose track of a lot of space for a long time, leading to a
lot of extra bloat.  However, it's a particularly bad policy for bulk
relation extension, because you're stuffing a large number of totally
free pages in there in a way that doesn't make them particularly easy
for anybody else to discover.  There are two ways we can fail here:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages.  This can happen in two separate ways, namely (a) the
lastValidBlock saved by RelationGetBufferForTuple() can be in the
middle of the relation someplace rather than near the end, or (b) the
bulk-extension performed by some other backend can have overflowed
onto some new FSM page that won't be searched even though a relatively
plausible lastValidBlock was passed.

It seems to me that since we're adding a whole bunch of empty pages at
once, it's worth the effort to update the upper levels of the FSM.
This isn't a case of discovering a single page with an extra few bytes
of storage available due to a HOT prune or something - this is a case
of putting at least 20 and plausibly hundreds of extra pages into the
FSM.   The extra effort to update the upper FSM pages is trivial by
comparison with the cost of extending the relation by many blocks.

So, I suggest adding a new function FreeSpaceMapBulkExtend(BlockNumber
first_block, BlockNumber last_block) which sets all the FSM entries
for pages between first_block and last_block to 255 and then bubbles
that up to the higher levels of the tree and all the way to the root.
Have the bulk extend code use that instead of repeatedly calling
RecordPageWithFreeSpace.  That should actually be much more efficient,
because it can call fsm_readbuf(), LockBuffer(), and
UnlockReleaseBuffer() just once per FSM page instead of once per FSM
page *per byte modified*.  Maybe that makes no difference in practice,
but it can't hurt.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

25 March 2016, 17:05:12

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the

Yeah, that's the issue, if extended pages spills to next FSM page, then other waiters will not find those page, and one by one all waiters will end up adding extra pages.
for example, if there are ~30 waiters then

total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page this will happen.

- FSM page case hold 4096 heap blk info, so after every 8th extend (assume 512 block will extend in one time), it will extend ~6500 pages

- Any new request to RelationGetBufferForTuple will be able to find those page, because by that time the backend which is extending the page would have set new block using RelationSetTargetBlock.
(there are still chances that some blocks can be completely left unused, until vacuum comes).

I have changed the patch as per the suggestion (just POC because performance number are not that great)

Below is the performance number comparison of base, previous patch(v13) and latest patch (v14).

performance of patch v14 is significantly low compared to v13, mainly I guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th extend), and that's why it's performing well.

2. In v13 as soon as we extend the block we add to FSM so immediately available for new requester, (In this patch also I tried to add one by one to FSM and updated fsm tree till root after all pages added to FSM, but no significant improvement).

3. fsm_update_recursive doesn't seems like problem to me. does it ?

Copy 10000 tuples, of 4 bytes each..
---------------------------------------------
Client     base   patch v13   patch v14
1           118          147        126
2           217          276        269
4           210          421        347
8           166          630        375
16         145          813        415
32         124          985        451
64                         974        455

Insert 1000 tuples of 1K size each.

Client        base    patch v13     patch v14
1              117         124            119
2              111         126            119
4               51          128            124
8               43          149            131
16             40          217            120
32                           263            115
64                           248            109

Note: I think one thread number can be just run to run variance..

Does anyone see problem in updating the FSM tree, I have debugged and saw that we are able to get the pages properly from tree and same is visible in performance number of v14 compared to base.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v14_poc.patch

Re: Relation extension scalability

From

Robert Haas

Date:

26 March 2016, 02:37:07

On Fri, Mar 25, 2016 at 1:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. Callers who use GetPageWithFreeSpace() rather than
>> GetPageFreeSpaceExtended() will fail to find the new pages if the
>> upper map levels haven't been updated by VACUUM.
>>
>> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find
>> the new pages.  This can happen in two separate ways, namely (a) the
>
> Yeah, that's the issue, if extended pages spills to next FSM page, then
> other waiters will not find those page, and one by one all waiters will end
> up adding extra pages.
> for example,  if there are ~30 waiters then
> total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.
>
> This is not the case every time but whenever heap block go to new FSM page
> this will happen.

I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done.  A patch that inserts the rows slower but the final
relation is smaller may be better overall.  Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards?  And maybe post the exact script you are using?

> performance of patch v14 is significantly low compared to v13, mainly I
> guess below reasons
> 1. As per above calculation v13 extend ~6500 block (after every 8th extend),
> and that's why it's performing well.

That should be completely unnecessary, though.  I mean, if the problem
is that it's expensive to repeatedly acquire and release the relation
extension lock, then bulk-extending even 100 blocks at a time should
be enough to fix that, because you've reduced the number of times that
has to be done by 99%. There's no way we should need to extend by
thousands of blocks to get good performance.

Maybe something like this would help:
   if (needLock)   {       if (!use_fsm)           LockRelationForExtension(relation, ExclusiveLock);       else if
(!ConditionLockRelationForExtension(relation,ExclusiveLock))       {           BlockNumber     last_blkno =
RelationGetNumberOfBlocks(relation);
           targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);           if (targetBlock != InvalidBlockNumber)               goto loop;
           LockRelationForExtension(relation, ExclusiveLock);           targetBlock = GetPageWithFreeSpace(relation,
len+ saveFreeSpace);           if (targetBlock != InvalidBlockNumber)           {
UnlockRelationForExtension(relation,ExclusiveLock);               goto loop;           }
RelationAddExtraBlocks(relation,bistate);       }   }

I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases.  But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

26 March 2016, 09:48:53

On Sat, Mar 26, 2016 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done. A patch that inserts the rows slower but the final
relation is smaller may be better overall. Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards? And maybe post the exact script you are using?

I have tested the size and performance, scripts are attached in the mail.

COPY 1-10 bytes tuple from 32 Clients

Base V13 V14

-------- --------- ---------

TPS 123 874 446

No. Of Tuples 148270000 1049980000 536370000

Relpages 656089 4652593 2485482

INSERT 1028 bytes Tuples From 16 Clients

Base V13 V14

-------- -------- ---------

TPS 42 211 120

No. Of Tuples 5149000 25343000 14524000

Rel Pages 735577 3620765 2140612

As per above results If we calculate the tuple number of tuples and respective relpages, then neither in v13 nor v14 there are extra unused pages.

As per my calculation for INSERT (1028 byte tuple) each page contain 7 tuples so

number of pages required

Base: 5149000/7 = 735571 (from relpages we can see 6 pages are extra)

V13 : 25343000/7= 3620428 (from relpages we can see ~300 pages are extra).

V14 : 14524000/7= 2074857 (from relpages we can see ~70000 pages are extra).

With V14 we have found max pages number of extra pages, I expected V13 to have max unused pages, but it's v14 and I tested it in multiple runs and v13 is always the winner. I tested with multiple client count also like 8, 32 and v13 always have only ~60-300 extra pages out of total ~2-4 Million Pages.

Attached files:

-------------------

test_size_ins.sh --> automated script to run insert test and calculate tuple and relpages.

test_size_copy --> automated script to run copy test and calculate tuple and relpages.

copy_script -> copy pg_bench script used by test_size_copy.sh

insert_script --> insert pg_bench script used by test_size_ins.sh

Maybe something like this would help:

if (needLock)
{
if (!use_fsm)
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
{
BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);

targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
goto loop;

LockRelationForExtension(relation, ExclusiveLock);
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
RelationAddExtraBlocks(relation, bistate);
}
}

I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Relation extension scalability

From

Dilip Kumar

Date:

27 March 2016, 11:53:04

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size function so results are better understandable.

Relation Size

-----------------

INSERT : 16000 transaction from 32 Client

Base v13 v14_1

--------- --------- --------

TPS 37 255 112

Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client

Base v13 v14_1

--------- --------- ---------

TPS 121 823 427

Rel Size 11GB 11GB 11GB

Script are attached in the mail

----------------------------------------=

test_size_ins.sh --> run insert test and calculate relation size.

test_size_copy --> run copy test and relation size

copy_script -> copy pg_bench script used by test_size_copy.sh

insert_script --> insert pg_bench script used by test_size_ins.sh

multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->

-------------------------

In AddExtraBlock

{

I add page to FSM one by one like v13 does.

then update the full FSM tree up till root

}

Results:

----------

1. With this performance is little less than v14 but the problem of extra relation size is solved.

2. With this we can conclude that extra size of relation in v14 is because some while extending the pages, its not immediately available and at end some of the pages are left unused.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Relation extension scalability

From

Dilip Kumar

Date:

27 March 2016, 12:00:38

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

**Something went wrong in last mail, seems like become separate thread, so resending the same mail **

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size function so results are better understandable.

Relation Size

-----------------

INSERT : 16000 transaction from 32 Client

Base v13 v14_1

--------- --------- --------

TPS 37 255 112

Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client

Base v13 v14_1

--------- --------- ---------

TPS 121 823 427

Rel Size 11GB 11GB 11GB

Script are attached in the mail

----------------------------------------=

test_size_ins.sh --> run insert test and calculate relation size.

test_size_copy --> run copy test and relation size

copy_script -> copy pg_bench script used by test_size_copy.sh

insert_script --> insert pg_bench script used by test_size_ins.sh

multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->

-------------------------

In AddExtraBlock

{

I add page to FSM one by one like v13 does.

then update the full FSM tree up till root

}

Results:

----------

1. With this performance is little less than v14 but the problem of extra relation size is solved.

2. With this we can conclude that extra size of relation in v14 is because some while extending the pages, its not immediately available and at end some of the pages are left unused.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Relation extension scalability

From

Robert Haas

Date:

27 March 2016, 20:25:39

On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Relation Size
> -----------------
> INSERT : 16000 transaction from 32 Client
>
>                      Base          v13         v14_1
>                     ---------      ---------      --------
> TPS                 37              255         112
> Rel Size           17GB         17GB     18GB
>
> COPY: 32000 transaction from 32 client
>                      Base          v13        v14_1
>                    ---------        ---------    ---------
> TPS                 121           823            427
> Rel Size           11GB        11GB          11GB
>
>
> Script are attached in the mail
> ----------------------------------------=
> test_size_ins.sh  --> run insert test and calculate relation size.
> test_size_copy   --> run copy test and relation size
> copy_script  -> copy pg_bench script used by test_size_copy.sh
> insert_script --> insert pg_bench script used by test_size_ins.sh
> multi_extend_v14_poc_v1.patch --> modified patch of v14.
>
> I also tried modifying v14 from different different angle.
>
> One is like below-->
> -------------------------
> In AddExtraBlock
> {
>    I add page to FSM one by one like v13 does.
>    then update the full FSM tree up till root
> }

Not following this.  Did you attach this version?

> Results:
> ----------
> 1. With this performance is little less than v14 but the problem of extra
> relation size is solved.
> 2. With this we can conclude that extra size of relation in v14 is because
> some while extending the pages, its not immediately available and at end
> some of the pages are left unused.

I agree with that conclusion.  I'm not quite sure where that leaves
us, though.  We can go back to v13, but why isn't that producing extra
pages?  It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

Another idea is:

If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.

Just brainstorming here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 March 2016, 01:51:15

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> One is like below-->
> -------------------------
> In AddExtraBlock
> {
> I add page to FSM one by one like v13 does.
> then update the full FSM tree up till root
> }

Not following this. Did you attach this version?

No I did not attached this.. During rough experiment, tried this, did not produced any patch, I will send this.

> Results:
> ----------
> 1. With this performance is little less than v14 but the problem of extra
> relation size is solved.
> 2. With this we can conclude that extra size of relation in v14 is because
> some while extending the pages, its not immediately available and at end
> some of the pages are left unused.

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

Yes same, I wanted to explained the same above.

Another idea is:

If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.

Just brainstorming here.

I think this is better option, Since we will search last two pages of FSM tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Amit Kapila

Date:

28 March 2016, 05:30:56

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > Results:
> > ----------
> > 1. With this performance is little less than v14 but the problem of extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion. I'm not quite sure where that leaves
> us, though. We can go back to v13, but why isn't that producing extra
> pages? It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>

I have not debugged the flow, but by looking at v13 code, it looks like it will search both old and new. In function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(), the basic idea of search is: Start the search from the target slot. At every step, move one

node to the right, then climb up to the parent. Stop when we reach a node with enough free space (as we must, since the root has enough space).

So shouldn't it be able to find the new FSM page where the bulk extend rolls over?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 March 2016, 06:11:53

On Mon, Mar 28, 2016 at 11:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have not debugged the flow, but by looking at v13 code, it looks like it will search both old and new. In function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(), the basic idea of search is: Start the search from the target slot. At every step, move one
node to the right, then climb up to the parent. Stop when we reach a node with enough free space (as we must, since the root has enough space).
So shouldn't it be able to find the new FSM page where the bulk extend rolls over?

This is actually multi level tree, So each FSM page contain one slot tree.

So fsm_search_avail() is searching only the slot tree, inside one FSM page. But we want to go to next FSM page.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 March 2016, 09:32:59

On Mon, Mar 28, 2016 at 7:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Our open question was why V13 is not producing extra pages, I tried to print some logs and debug. It seems to me that,

when blocks are spilling to next FSM pages, that time all backends who are waiting on lock will not get the block because searching in old FSM page. But the backend which is extending the pages will set RelationSetTargetBlock to latest blocks, and that will make new FSM page available for search by new requesters.

1. So this is why v13 (in normal cases*1) not producing unused pages.

2. But it will produce extra pages (which will be consumed by new requesters), because all waiter will come one by one and extend 512 pages.

*1 : Above I have mentioned normal case, I mean there is some case exist where V13 can leave unused page, Like one by one each waiter Get the lock and extend the page, but no one go down till RelationSetTargetBlock so till now new pages are not available by new requester, and time will come when blocks will spill to third FSM page, now one by one all backends go down and set RelationSetTargetBlock, and suppose last one set it to the block which is in 3rd FSM page, in this case, pages in second FSM pages are unused.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.
I think this is better option, Since we will search last two pages of FSM tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.

I have created this patch and results are as below.

* All test scripts are same attached upthread

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance

-----------------------------------------------------------

Client base v13 v15

1 117 124 122

2 111 126 123

4 51 128 125

8 43 149 135

16 40 217 147

32 35 263 141

3. COPY 10000 Tuple performance.

----------------------------------------------

Client base v13 v15

1 118 147 155

2 217 276 273

4 210 421 457

8 166 630 643

16 145 813 595

32 124 985 598

Conclusion:

---------------

1. I think v15 is solving the problem exist with v13 and performance is significantly high compared to base, and relation size is also stable, So IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has bug of sometime extending more than expected pages and that is uncontrolled and same can be the reason also of v13 performing better.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v15.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

28 March 2016, 12:46:13

On Mon, Mar 28, 2016 at 3:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance
-----------------------------------------------------------
Client base v13 v15
1 117 124 122
2 111 126 123
4 51 128 125
8 43 149 135
16 40 217 147
32 35 263 141

3. COPY 10000 Tuple performance.
----------------------------------------------
Client base v13 v15
1 118 147 155
2 217 276 273
4 210 421 457
8 166 630 643
16 145 813 595
32 124 985 598

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance is significantly high compared to base, and relation size is also stable, So IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has bug of sometime extending more than expected pages and that is uncontrolled and same can be the reason also of v13 performing better.

Found one problem with V15, so sending the new version.

In V15 I am taking prev_blkno as targetBlock instead it should be the last block of the relation at that time. Attaching new patch.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v16.patch

Re: Relation extension scalability

From

Petr Jelinek

Date:

28 March 2016, 21:51:42

On 28/03/16 14:46, Dilip Kumar wrote:
>
>     Conclusion:
>     ---------------
>     1. I think v15 is solving the problem exist with v13 and performance
>     is significantly high compared to base, and relation size is also
>     stable, So IMHO V15 is winner over other solution, what other thinks ?
>

It seems so, do you have ability to reasonably test with 64 clients? I 
am mostly wondering if we see the performance going further down or just 
plateau.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services

Re: Relation extension scalability

From

Robert Haas

Date:

29 March 2016, 01:56:41

On Sun, Mar 27, 2016 at 9:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think this is better option, Since we will search last two pages of FSM
> tree, then no need to update the upper level of the FSM tree. Right ?

Well, it's less important in that case, but I think it's still worth
doing.  Some people are going to do just plain GetPageWithFreeSpace().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Robert Haas

Date:

29 March 2016, 02:00:03

On Mon, Mar 28, 2016 at 8:46 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Found one problem with V15, so sending the new version.
> In V15 I am taking prev_blkno as targetBlock instead it should be the last
> block of the relation at that time. Attaching new patch.
    BlockNumber targetBlock,
-                otherBlock;
+                otherBlock,
+                prev_blkno = RelationGetNumberOfBlocks(relation);

Absolutely not.  There is no way it's acceptable to introduce an
unconditional call to RelationGetNumberOfBlocks() into every call to
RelationGetBufferForTuple().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Amit Kapila

Date:

29 March 2016, 04:39:00

On Tue, Mar 29, 2016 at 3:21 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 28/03/16 14:46, Dilip Kumar wrote:

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance
is significantly high compared to base, and relation size is also
stable, So IMHO V15 is winner over other solution, what other thinks ?

It seems so, do you have ability to reasonably test with 64 clients? I am mostly wondering if we see the performance going further down or just plateau.

Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

29 March 2016, 08:39:20

On Tue, Mar 29, 2016 at 7:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, it's less important in that case, but I think it's still worth
doing. Some people are going to do just plain GetPageWithFreeSpace().

I am attaching new version v17.

Its like this...

In RelationAddExtraBlocks

{

-- Add Block one by one to FSM.

-- Update FSM tree all the way to root

}

In RelationGetBufferForTuple

--- Same as v14, search the FSM tree from root. GetPageWithFreeSpace

Summary:

--------------

1. By Adding block to FSM tree one by one it solves the problem of unused blocks in V14.

2. It Update the FSM tree all they up to root, so anybody search from root can get the block.

3. It also search the block from root, so it don't have problem like v15 has(Exactly identifying which two FSM page to search).

4. This solves the performance problem of V14 by some optimizations in logic of updating FSM tree till root.

Performance Data:

--------------------------

Client base v17

-------- -------- --------

1 117 120

2 111 123

4 51 124

8 43 135

16 40 145

32 35 144

64 -- 140

Client base v17

------- ------- ------

1 118 117

2 217 220

4 210 379

8 166 574

16 145 594

32 124 599

64 ---- 609

Notes:

---------

If I do some change in this patch in strategy of searching the block, performance remains almost the same.

1. Like search in two block like v15 or v17 does.

2. Search first using target block and if don't get then search from top.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v17.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

29 March 2016, 17:30:04

On Tue, Mar 29, 2016 at 2:09 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Attaching new version v18

- Some cleanup work on v17.

- Improved UpdateFreeSpaceMap function.

- Performance and space utilization are same as V17

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v18.patch

Re: Relation extension scalability

From

Robert Haas

Date:

30 March 2016, 01:49:32

On Tue, Mar 29, 2016 at 1:29 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Attaching new version v18
>
> - Some cleanup work on v17.
> - Improved  UpdateFreeSpaceMap function.
> - Performance and space utilization are same as V17

Looks better.  Here's a v19 that I hacked on a bit.

Unfortunately, one compiler I tried this with had a pretty legitimate complaint:

hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here

There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.

It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

multi_extend_v19.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

30 March 2016, 02:21:42

On Wed, Mar 30, 2016 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for review and better comments..

hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here

I have fixed those in v20

There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.

It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.

Added comments..

+ if (lockWaiters)

+ /*

+ * Here we are using same freespace for all the Blocks, but that

+ * is Ok, because all are newly added blocks and have same freespace

+ * And even some block which we just added to FreespaceMap above, is

+ * used by some backend and now freespace is not same, will not harm

+ * anything, because actual freespace will be calculated by user

+ * after getting the page.

+ */

+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);

Does this look good ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v20.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

30 March 2016, 02:54:32

On Wed, Mar 30, 2016 at 7:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ if (lockWaiters)
+ /*
+ * Here we are using same freespace for all the Blocks, but that
+ * is Ok, because all are newly added blocks and have same freespace
+ * And even some block which we just added to FreespaceMap above, is
+ * used by some backend and now freespace is not same, will not harm
+ * anything, because actual freespace will be calculated by user
+ * after getting the page.
+ */
+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);

Does this look good ?

Done in better way..

+ lockWaiters = RelationExtensionLockWaiterCount(relation);

+ if (lockWaiters == 0)

+ return;

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

multi_extend_v21.patch

Re: Relation extension scalability

From

Dilip Kumar

Date:

31 March 2016, 04:59:22

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to find out why we are not as good as V13

with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000

--------------------

Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))

-------- ---------

16 752

32 708

This proves that main reason of v13 being better is its adding extra blocks without control.

though v13 is better than these results, I think we can get that also by changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

Does this test make sense ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Amit Kapila

Date:

31 March 2016, 08:01:34

On Thu, Mar 31, 2016 at 10:29 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to find out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra blocks without control.
though v13 is better than these results, I think we can get that also by changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

This shows that there is performance increase of ~26% (599 to 752) at 16 client count if we raise the limit of max extend size from 4MB to 16MB which is a good boost, but not sure if it is worth extending the relation for cases where the newly added pages won't get used. I think it should be okay to go for 4MB as a limit for now and then if during beta testing or in future, there are use cases where tuning this max limit helps, then we can come back to it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

31 March 2016, 10:58:44

On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> Yes, that makes sense.  One more point is that if the reason for v13
>> giving better performance is extra blocks (which we believe in certain cases
>> can leak till the time Vacuum updates the FSM tree), do you think it makes
>> sense to once test by increasing lockWaiters * 20 limit to may be
>> lockWaiters * 25 or lockWaiters * 30.
>
>
> I tested COPY 10000 record, by increasing the number of blocks just to find
> out why we are not as good as V13
>  with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
>
> COPY 10000
> --------------------
> Client  Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> --------           ---------
> 16 752
> 32 708
>
> This proves that main reason of v13 being better is its adding extra blocks
> without control.
> though v13 is better than these results, I think we can get that also by
> changing multiplier and max limit .
>
> But I think we are ok with the max size as 4MB (512 blocks) right?

Yeah, kind of.  But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes.  And I found cases where it was actually slower with the
patch.  Now maybe hydra is just a crap machine, but I'm feeling
nervous.

What machines did you use to test this?  Have you tested really large
data loads, like many MB or even GB of data?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Amit Kapila

Date:

31 March 2016, 11:14:10

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> Yes, that makes sense. One more point is that if the reason for v13
>> giving better performance is extra blocks (which we believe in certain cases
>> can leak till the time Vacuum updates the FSM tree), do you think it makes
>> sense to once test by increasing lockWaiters * 20 limit to may be
>> lockWaiters * 25 or lockWaiters * 30.
>
>
> I tested COPY 10000 record, by increasing the number of blocks just to find
> out why we are not as good as V13
> with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
>
> COPY 10000
> --------------------
> Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> -------- ---------
> 16 752
> 32 708
>
> This proves that main reason of v13 being better is its adding extra blocks
> without control.
> though v13 is better than these results, I think we can get that also by
> changing multiplier and max limit .
>
> But I think we are ok with the max size as 4MB (512 blocks) right?

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes.

Is this test for unlogged tables? As far as I understand this patch will show benefit if Data and WAL are on separate disks or if you test them with unlogged tables, otherwise the WAL contention supersedes the benefit of this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Dilip Kumar

Date:

31 March 2016, 13:03:49

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes. And I found cases where it was actually slower with the
patch. Now maybe hydra is just a crap machine, but I'm feeling
nervous.

I see the performance gain when either

"complete data is in SSD"

or "data on MD and WAL on SSD"

or unlogged table.

What machines did you use to test this? Have you tested really large
data loads, like many MB or even GB of data?

With INSERT Script within 2 mins run data size is 18GB I am running 5-10 Mins means at least 85GB data.

(Inserts 1000 1KB tuples in each transactions)

With COPY Script within 2 mins run data size is 23GB and I am running 5-10 Mins means at least 100GB data.

(Inserts 10000 tuples in each transactions (tuple size is 1byte to 5 bytes)

Machine Details

-----------------------

I tested in 8 socket NUMA machine with 64 core.

Machine Details:

----------------------

[dilip.kumar@cthulhu ~]$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 128

On-line CPU(s) list: 0-127

Thread(s) per core: 2

Core(s) per socket: 8

Socket(s): 8

NUMA node(s): 8

Vendor ID: GenuineIntel

CPU family: 6

Model: 47

Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz

Stepping: 2

CPU MHz: 1064.000

BogoMIPS: 4266.62

If you need some more information please let me know ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Relation extension scalability

From

Robert Haas

Date:

05 April 2016, 17:02:12

On Thu, Mar 31, 2016 at 9:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> If you need some more information please let me know ?

I repeated the testing described in
http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
on a MacBook Pro (OS X 10.8.5, 2.4 GHz Intel Core i7, 8GB, early 2013)
and got the following results.  Note that I did not adjust
*_flush_delay in this test because that's always 0, apparently, on
MacOS.

master, unlogged tables, 1 copy: 0m18.928s, 0m20.276s, 0m18.040s
patched, unlogged tables, 1 copy: 0m20.499s, 0m20.879s, 0m18.912s
master, unlogged tables, 4 parallel copies: 0m57.301s, 0m58.045s, 0m57.556s
patched, unlogged tables, 4 parallel copies: 0m47.994s, 0m45.586s, 0m44.440s

master, logged tables, 1 copy: 0m29.353s, 0m29.693s, 0m31.840s
patched, logged tables, 1 copy: 0m30.837s, 0m31.567s, 0m36.843s
master, logged tables, 4 parallel copies: 1m45.691s, 1m53.085s, 1m35.674s
patched, logged tables, 4 parallel copies: 1m21.137s, 1m20.678s, 1m22.419s

So the first thing here is that the patch seems to be a clear win in
this test.  For a single copy, it seems to be pretty much a wash.
When running 4 copies in parallel, it is about 20-25% faster with both
logged and unlogged tables.  The second thing that is interesting is
that we are getting super-linear scalability even without the patch:
if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
it really takes 60 unpatched or 45 patched.  If 1 copy takes 30
seconds, you might expect 4 to take 120 seconds, but in really takes
105 unpatched or 80 patched.  So we're not actually I/O constrained on
this test, I think, perhaps because this machine has an SSD.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Peter Geoghegan

Date:

05 April 2016, 17:05:46

On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So the first thing here is that the patch seems to be a clear win in
> this test.  For a single copy, it seems to be pretty much a wash.
> When running 4 copies in parallel, it is about 20-25% faster with both
> logged and unlogged tables.  The second thing that is interesting is
> that we are getting super-linear scalability even without the patch:
> if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
> it really takes 60 unpatched or 45 patched.  If 1 copy takes 30
> seconds, you might expect 4 to take 120 seconds, but in really takes
> 105 unpatched or 80 patched.  So we're not actually I/O constrained on
> this test, I think, perhaps because this machine has an SSD.

It's not unusual for COPY to not be I/O constrained, I believe.

-- 
Peter Geoghegan

Re: Relation extension scalability

From

Robert Haas

Date:

08 April 2016, 06:08:34

On Tue, Apr 5, 2016 at 1:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> So the first thing here is that the patch seems to be a clear win in
>> this test.  For a single copy, it seems to be pretty much a wash.
>> When running 4 copies in parallel, it is about 20-25% faster with both
>> logged and unlogged tables.  The second thing that is interesting is
>> that we are getting super-linear scalability even without the patch:
>> if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
>> it really takes 60 unpatched or 45 patched.  If 1 copy takes 30
>> seconds, you might expect 4 to take 120 seconds, but in really takes
>> 105 unpatched or 80 patched.  So we're not actually I/O constrained on
>> this test, I think, perhaps because this machine has an SSD.
>
> It's not unusual for COPY to not be I/O constrained, I believe.

Yeah.  I've committed the patch now, with some cosmetic cleanup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relation extension scalability

From

Dilip Kumar

Date:

08 April 2016, 06:15:28

On Fri, Apr 8, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah. I've committed the patch now, with some cosmetic cleanup.

Thanks Robert !!!

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com