Thread: Relation extension scalability
Hello, Currently bigger shared_buffers settings don't combine well with relations being extended frequently. Especially if many/most pages have a high usagecount and/or are dirty and the system is IO constrained. As a quick recap, relation extension basically works like: 1) We lock the relation for extension 2) ReadBuffer*(P_NEW) is being called, to extend the relation 3) smgrnblocks() is used to find the new target block 4) We search for a victim buffer (via BufferAlloc()) to put the new block into 5) If dirty the victim buffer is cleaned 6) The relation is extended using smgrextend() 7) The page is initialized The problems come from 4) and 5) potentially each taking a fair while. If the working set mostly fits into shared_buffers 4) can requiring iterating over all shared buffers several times to find a victim buffer. If the IO subsystem is buys and/or we've hit the kernel's dirty limits 5) can take a couple seconds. I've prototyped solving this for heap relations moving the smgrnblocks() + smgrextend() calls to RelationGetBufferForTuple(). With some care (including a retry loop) it's possible to only do those two under the extension lock. That indeed fixes problems in some of my tests. I'm not sure whether the above is the best solution however. For one I think it's not necessarily a good idea to opencode this in hio.c - I've not observed it, but this probably can happen for btrees and such as well. For another, this is still a exclusive lock while we're doing IO: smgrextend() wants a page to write out, so we have to be careful not to overwrite things. There's two things that seem to make sense to me: First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW and introduce a bufmgr function specifically for extension. Secondly I think we could maybe remove the requirement of needing an extension lock alltogether. It's primarily required because we're worried that somebody else can come along, read the page, and initialize it before us. ISTM that could be resolved by *not* writing any data via smgrextend()/mdextend(). If we instead only do the write once we've read in & locked the page exclusively there's no need for the extension lock. We probably still should write out the new page to the OS immediately once we've initialized it; to avoid creating sparse files. The other reason we need the extension lock is that code like lazy_scan_heap() and btvacuumscan() that tries to avoid initializing pages that are about to be initilized by the extending backend. I think we should just remove that code and deal with the problem by retrying in the extending backend; that's why I think moving extension to a different file might be helpful. I've attached my POC for heap extension, but it's really just a POC at this point. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > As a quick recap, relation extension basically works like: > 1) We lock the relation for extension > 2) ReadBuffer*(P_NEW) is being called, to extend the relation > 3) smgrnblocks() is used to find the new target block > 4) We search for a victim buffer (via BufferAlloc()) to put the new > block into > 5) If dirty the victim buffer is cleaned > 6) The relation is extended using smgrextend() > 7) The page is initialized > The problems come from 4) and 5) potentially each taking a fair > while. Right, so basically we want to get those steps out of the exclusive lock scope. > There's two things that seem to make sense to me: > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW > and introduce a bufmgr function specifically for extension. I think that removing P_NEW is likely to require a fair amount of refactoring of calling code, so I'm not thrilled with doing that. On the other hand, perhaps all that code would have to be touched anyway to modify the scope over which the extension lock is held. > Secondly I think we could maybe remove the requirement of needing an > extension lock alltogether. It's primarily required because we're > worried that somebody else can come along, read the page, and initialize > it before us. ISTM that could be resolved by *not* writing any data via > smgrextend()/mdextend(). I'm afraid this would break stuff rather thoroughly, in particular handling of out-of-disk-space cases. And I really don't see how you get consistent behavior at all for multiple concurrent callers if there's no locking. One idea that might help is to change smgrextend's API so that it doesn't need a buffer to write from, but just has an API of "add a prezeroed block on-disk and tell me the number of the block you added". On the other hand, that would then require reading in the block after allocating a buffer to hold it (I don't think you can safely assume otherwise) so the added read step might eat any savings. regards, tom lane
On 2015-03-29 15:21:44 -0400, Tom Lane wrote: > > There's two things that seem to make sense to me: > > > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW > > and introduce a bufmgr function specifically for extension. > > I think that removing P_NEW is likely to require a fair amount of > refactoring of calling code, so I'm not thrilled with doing that. > On the other hand, perhaps all that code would have to be touched > anyway to modify the scope over which the extension lock is held. It's not *that* many locations that need to extend relations. In my playing around it seemed to me they all would need to be modified anyway; if we want to remove/reduce the scope of extension locks to deal with the fact that somebody else could have started to use the buffer. > > Secondly I think we could maybe remove the requirement of needing an > > extension lock alltogether. It's primarily required because we're > > worried that somebody else can come along, read the page, and initialize > > it before us. ISTM that could be resolved by *not* writing any data via > > smgrextend()/mdextend(). > > I'm afraid this would break stuff rather thoroughly, in particular > handling of out-of-disk-space cases. Hm. Not a bad point. > And I really don't see how you get > consistent behavior at all for multiple concurrent callers if there's no > locking. What I was thinking is something like this: while (true) { targetBuffer = AcquireFromFSMEquivalent(); if (targetBuffer == InvalidBuffer) targetBuffer = ExtendRelation(); LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE); if (BufferHasEnoughSpace(targetBuffer)) break; LockBuffer(BUFFER_LOCK_UNLOCK); } where ExtendRelation() would basically work like while (true) { targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192; buffer = ReadBuffer(rel, targetBlock); LockBuffer(buffer,BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buffer); if (PageIsNew(page)) { PageInit(page); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); FlushBuffer(buffer); break; } else { LockBuffer(buffer, BUFFER_LOCK_UNLOCK); continue; } } Obviously it's a tad more complex than that pseudocode, but I think that basically should work. Except that it, as you can say, lead to some oddities with out of space handling. I think it should actually be ok, it might just be confusing to the user. I think we might be able to address those issues by not using lseek(SEEK_SET) but instead fcntl(fd, F_SETFL, O_APPEND, 1); write(fd, pre-init-block, BLCKSZ); fcntl(fd, F_SETFL, O_APPEND, 0); newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ; by using O_APPEND and a pre-initialized block we can be sure to write a block at the end, that's valid, and shouldn't run afould of any out of space issues that we don't already have. Unfortunately I'm not sure whether fcntl for O_APPEND is portable :( > One idea that might help is to change smgrextend's API so that it doesn't > need a buffer to write from, but just has an API of "add a prezeroed block > on-disk and tell me the number of the block you added". On the other > hand, that would then require reading in the block after allocating a > buffer to hold it (I don't think you can safely assume otherwise) so the > added read step might eat any savings. Yea, I was thinking that as well. We simply could skip the reading step by setting up the contents in the buffer manager without a read in this case... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2015-03-29 15:21:44 -0400, Tom Lane wrote: >> One idea that might help is to change smgrextend's API so that it doesn't >> need a buffer to write from, but just has an API of "add a prezeroed block >> on-disk and tell me the number of the block you added". On the other >> hand, that would then require reading in the block after allocating a >> buffer to hold it (I don't think you can safely assume otherwise) so the >> added read step might eat any savings. > Yea, I was thinking that as well. We simply could skip the reading step > by setting up the contents in the buffer manager without a read in this > case... No, you can't, at least not if the point is to not be holding any exclusive lock by the time you go to talk to the buffer manager. There will be nothing stopping some other backend from writing into that page of the file before you can get hold of it. If the buffer they used to do the write has itself gotten recycled, there is nothing left at all to tell you your page image is out of date. regards, tom lane
On 2015-03-29 16:07:49 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2015-03-29 15:21:44 -0400, Tom Lane wrote: > >> One idea that might help is to change smgrextend's API so that it doesn't > >> need a buffer to write from, but just has an API of "add a prezeroed block > >> on-disk and tell me the number of the block you added". On the other > >> hand, that would then require reading in the block after allocating a > >> buffer to hold it (I don't think you can safely assume otherwise) so the > >> added read step might eat any savings. > > > Yea, I was thinking that as well. We simply could skip the reading step > > by setting up the contents in the buffer manager without a read in this > > case... > > No, you can't, at least not if the point is to not be holding any > exclusive lock by the time you go to talk to the buffer manager. There > will be nothing stopping some other backend from writing into that page of > the file before you can get hold of it. If the buffer they used to do the > write has itself gotten recycled, there is nothing left at all to tell you > your page image is out of date. That's why I'd proposed restructuring things so that the actual extension/write to the file only happens once we have the buffer manager exclusive lock on the individual buffer. While not trvia to implement it doesn't look prohibitively complex. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-03-29 20:02:06 -0400, Robert Haas wrote: > On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> > > As a quick recap, relation extension basically works like: > > 1) We lock the relation for extension > > 2) ReadBuffer*(P_NEW) is being called, to extend the relation > > 3) smgrnblocks() is used to find the new target block > > 4) We search for a victim buffer (via BufferAlloc()) to put the new > > block into > > 5) If dirty the victim buffer is cleaned > > 6) The relation is extended using smgrextend() > > 7) The page is initialized > > > > The problems come from 4) and 5) potentially each taking a fair > > while. If the working set mostly fits into shared_buffers 4) can > > requiring iterating over all shared buffers several times to find a > > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's > > dirty limits 5) can take a couple seconds. > > Interesting. I had always assumed the bottleneck was waiting for the > filesystem to extend the relation. That might be the case sometimes, but it's not what I've actually observed so far. I think most modern filesystems doing preallocation resolved this to some degree. > > Secondly I think we could maybe remove the requirement of needing an > > extension lock alltogether. It's primarily required because we're > > worried that somebody else can come along, read the page, and initialize > > it before us. ISTM that could be resolved by *not* writing any data via > > smgrextend()/mdextend(). If we instead only do the write once we've read > > in & locked the page exclusively there's no need for the extension > > lock. We probably still should write out the new page to the OS > > immediately once we've initialized it; to avoid creating sparse files. > > > > The other reason we need the extension lock is that code like > > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing > > pages that are about to be initilized by the extending backend. I think > > we should just remove that code and deal with the problem by retrying in > > the extending backend; that's why I think moving extension to a > > different file might be helpful. > > I thought the primary reason we did this is because we wanted to > write-and-fsync the block so that, if we're out of disk space, any > attendant failure will happen before we put data into the block. Well, we only write and register a fsync. Afaics we don't actually perform the fsync it at that point. I don't think having to do the fsync() necessarily precludes removing the extension lock. > Once we've initialized the block, a subsequent failure to write or > fsync it will be hard to recover from; At the very least the buffer shouldn't become dirty before we successfully wrote once, right. It seems quite doable to achieve that without the lock though. We'll have to do the write without going through the buffer manager, but that seems doable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com> > As a quick recap, relation extension basically works like: > 1) We lock the relation for extension > 2) ReadBuffer*(P_NEW) is being called, to extend the relation > 3) smgrnblocks() is used to find the new target block > 4) We search for a victim buffer (via BufferAlloc()) to put the new > block into > 5) If dirty the victim buffer is cleaned > 6) The relation is extended using smgrextend() > 7) The page is initialized > > The problems come from 4) and 5) potentially each taking a fair > while. If the working set mostly fits into shared_buffers 4) can > requiring iterating over all shared buffers several times to find a > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's > dirty limits 5) can take a couple seconds. Interesting. I had always assumed the bottleneck was waiting for the filesystem to extend the relation. > Secondly I think we could maybe remove the requirement of needing an > extension lock alltogether. It's primarily required because we're > worried that somebody else can come along, read the page, and initialize > it before us. ISTM that could be resolved by *not* writing any data via > smgrextend()/mdextend(). If we instead only do the write once we've read > in & locked the page exclusively there's no need for the extension > lock. We probably still should write out the new page to the OS > immediately once we've initialized it; to avoid creating sparse files. > > The other reason we need the extension lock is that code like > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing > pages that are about to be initilized by the extending backend. I think > we should just remove that code and deal with the problem by retrying in > the extending backend; that's why I think moving extension to a > different file might be helpful. I thought the primary reason we did this is because we wanted to write-and-fsync the block so that, if we're out of disk space, any attendant failure will happen before we put data into the block. Once we've initialized the block, a subsequent failure to write or fsync it will be hard to recover from; basically, we won't be able to checkpoint any more. If we discover the problem while the block is still all-zeroes, the transaction that uncovers the problem errors out, but the system as a whole is still OK. Or at least, I think. Maybe I'm misunderstanding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I thought the primary reason we did this is because we wanted to > write-and-fsync the block so that, if we're out of disk space, any > attendant failure will happen before we put data into the block. Once > we've initialized the block, a subsequent failure to write or fsync it > will be hard to recover from; basically, we won't be able to > checkpoint any more. If we discover the problem while the block is > still all-zeroes, the transaction that uncovers the problem errors > out, but the system as a whole is still OK. Yeah. As Andres says, the fsync is not an important part of that, but we do expect that ENOSPC will happen during the initial write() if it's going to happen. To some extent that's an obsolete assumption, I'm afraid --- I believe that some modern filesystems don't necessarily overwrite the previous version of a block, which would mean that they are capable of failing with ENOSPC even during a re-write of a previously-written block. However, the possibility of filesystem misfeasance of that sort doesn't excuse us from having a clear recovery strategy for failures during relation extension. regards, tom lane
>
> Hello,
>
> Currently bigger shared_buffers settings don't combine well with
> relations being extended frequently. Especially if many/most pages have
> a high usagecount and/or are dirty and the system is IO constrained.
>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.
>
> I've prototyped solving this for heap relations moving the smgrnblocks()
> + smgrextend() calls to RelationGetBufferForTuple(). With some care
> (including a retry loop) it's possible to only do those two under the
> extension lock. That indeed fixes problems in some of my tests.
>
> I'm not sure whether the above is the best solution however.
On 2015-03-30 09:33:57 +0530, Amit Kapila wrote: > In the past, I have observed in one of the Write-oriented tests that > backend's have to flush the pages by themselves many a times, so > in above situation that can lead to more severe bottleneck. Yes. > > I've prototyped solving this for heap relations moving the smgrnblocks() > > + smgrextend() calls to RelationGetBufferForTuple(). With some care > > (including a retry loop) it's possible to only do those two under the > > extension lock. That indeed fixes problems in some of my tests. > > > > So do this means that the problem is because of contention on extension > lock? Yes, at least commonly. Obviously the extension lock would be less of a problem if we were better at having clean victim buffer ready. > > I'm not sure whether the above is the best solution however. > > Another thing to note here is that during extension we are extending > just one block, won't it make sense to increment it by some bigger > number (we can even take input from user for the same where user > can specify how much to autoextend a relation when the relation doesn't > have any empty space). During mdextend(), we might increase just one > block, however we can register the request for background process to > increase the size similar to what is done for fsync. I think that's pretty much a separate patch. Made easier by moving things out of under the lock maybe. Other than that I'd prefer not to mix things. There's a whole bunch of unrelated complexity that I don't want to attach to the topic at the same time (autovacuum truncayting again and so on). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 3/30/15 6:45 AM, Andres Freund wrote: > On 2015-03-30 09:33:57 +0530, Amit Kapila wrote: >> Another thing to note here is that during extension we are extending >> just one block, won't it make sense to increment it by some bigger >> number (we can even take input from user for the same where user >> can specify how much to autoextend a relation when the relation doesn't >> have any empty space). During mdextend(), we might increase just one >> block, however we can register the request for background process to >> increase the size similar to what is done for fsync. > > I think that's pretty much a separate patch. Made easier by moving > things out of under the lock maybe. Other than that I'd prefer not to > mix things. There's a whole bunch of unrelated complexity that I don't > want to attach to the topic at the same time (autovacuum truncayting > again and so on). Agreed that it makes more sense for this to be in a separate patch, but I definitely like the idea. A user configurable setting would be fine, but better would be to learn from the current growth rate of the table and extend based on that. For, instance, if a table is very large but is only growing by a few rows a day, there's probably no need for a large extent. Conversely, an initially small table growing by 1GB per minute would definitely benefit from large extents and it would be good to be able to track growth and compute extent sizes accordingly. Of course, a manual setting to start with would cover most use cases. Large tables in a database are generally in the minority and known in advance. -- - David Steele david@pgmasters.net
* David Steele (david@pgmasters.net) wrote: > On 3/30/15 6:45 AM, Andres Freund wrote: > > On 2015-03-30 09:33:57 +0530, Amit Kapila wrote: > >> Another thing to note here is that during extension we are extending > >> just one block, won't it make sense to increment it by some bigger > >> number (we can even take input from user for the same where user > >> can specify how much to autoextend a relation when the relation doesn't > >> have any empty space). During mdextend(), we might increase just one > >> block, however we can register the request for background process to > >> increase the size similar to what is done for fsync. > > > > I think that's pretty much a separate patch. Made easier by moving > > things out of under the lock maybe. Other than that I'd prefer not to > > mix things. There's a whole bunch of unrelated complexity that I don't > > want to attach to the topic at the same time (autovacuum truncayting > > again and so on). > > Agreed that it makes more sense for this to be in a separate patch, but > I definitely like the idea. > > A user configurable setting would be fine, but better would be to learn > from the current growth rate of the table and extend based on that. > > For, instance, if a table is very large but is only growing by a few > rows a day, there's probably no need for a large extent. Conversely, an > initially small table growing by 1GB per minute would definitely benefit > from large extents and it would be good to be able to track growth and > compute extent sizes accordingly. > > Of course, a manual setting to start with would cover most use cases. > Large tables in a database are generally in the minority and known in > advance. If we're able to extend based on page-level locks rather than the global relation locking that we're doing now, then I'm not sure we really need to adjust how big the extents are any more. The reason for making bigger extents is because of the locking problem we have now when lots of backends want to extend a relation, but, if I'm following correctly, that'd go away with Andres' approach. We don't have full patches for either of these and so I don't mind saying that, basically, I'd prefer to see if we still have a big bottleneck here with lots of backends trying to extend the same relation before we work on adding this particular feature in as it might end up being unnecessary. Now, if someone shows up tomorrow with a patch to do this and Andres' approach ends up not progressing, then we should certainly consider it (in due time and with consideration to the activities going on for 9.5, of course). Thanks! Stephen
>
>
> If we're able to extend based on page-level locks rather than the global
> relation locking that we're doing now, then I'm not sure we really need
> to adjust how big the extents are any more. The reason for making
> bigger extents is because of the locking problem we have now when lots
> of backends want to extend a relation, but, if I'm following correctly,
> that'd go away with Andres' approach.
>
> We don't have full patches for either of these and so I don't mind
> saying that, basically, I'd prefer to see if we still have a big
> bottleneck here with lots of backends trying to extend the same relation
> before we work on adding this particular feature in as it might end up
> being unnecessary.
On 3/30/15 10:48 PM, Amit Kapila wrote: > > If we're able to extend based on page-level locks rather than the global > > relation locking that we're doing now, then I'm not sure we really need > > to adjust how big the extents are any more. The reason for making > > bigger extents is because of the locking problem we have now when lots > > of backends want to extend a relation, but, if I'm following correctly, > > that'd go away with Andres' approach. > > > > The benefit of extending in bigger chunks in background is that backend > would need to perform such an operation at relatively lesser frequency > which in itself could be a win. The other potential advantage (and I have to think this could be a BIG advantage) is extending by a large amount makes it more likely you'll get contiguous blocks on the storage. That's going to make a big difference for SeqScan speed. It'd be interesting if someone with access to some real systems could test that. In particular, seqscan of a possibly fragmented table vs one of the same size but created at once. For extra credit, compare to dd bs=8192 of a file of the same size as the overall table. What I've seen in the real world is very, very poor SeqScan performance of tables that were relatively large. So bad that I had to SeqScan 8-16 tables in parallel to max out the IO system the same way I could with a single dd bs=8k of a large file (in this case, something like 480MB/s). A single SeqScan would only do something like 30MB/s. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 02-04-2015 AM 09:24, Jim Nasby wrote: > The other potential advantage (and I have to think this could be a BIG > advantage) is extending by a large amount makes it more likely you'll get > contiguous blocks on the storage. That's going to make a big difference for > SeqScan speed. It'd be interesting if someone with access to some real systems > could test that. In particular, seqscan of a possibly fragmented table vs one > of the same size but created at once. For extra credit, compare to dd bs=8192 > of a file of the same size as the overall table. > Orthogonal to topic of the thread but this comment made me recall a proposal couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps the case? Amit [0] http://www.postgresql.org/message-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com
* Amit Langote (Langote_Amit_f8@lab.ntt.co.jp) wrote: > On 02-04-2015 AM 09:24, Jim Nasby wrote: > > The other potential advantage (and I have to think this could be a BIG > > advantage) is extending by a large amount makes it more likely you'll get > > contiguous blocks on the storage. That's going to make a big difference for > > SeqScan speed. It'd be interesting if someone with access to some real systems > > could test that. In particular, seqscan of a possibly fragmented table vs one > > of the same size but created at once. For extra credit, compare to dd bs=8192 > > of a file of the same size as the overall table. > > Orthogonal to topic of the thread but this comment made me recall a proposal > couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps > the case? As I recall, it didn't, and further, modern filesystems are pretty good about avoiding fragmentation anyway.. I'm not saying Jim's completely off-base with this idea, I'm just not sure that it'll really buy us much. Thanks, Stephen
On Sun, Mar 29, 2015 at 11:56 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I'm not sure whether the above is the best solution however. For one I > think it's not necessarily a good idea to opencode this in hio.c - I've > not observed it, but this probably can happen for btrees and such as > well. For another, this is still a exclusive lock while we're doing IO: > smgrextend() wants a page to write out, so we have to be careful not to > overwrite things. > I think relaxing a global lock will fix the contention mostly. However, several people suggested that extending with many pages have other benefits. This hints for a more fundamental change in our storage model. Currently we map one file per relation. While it is simple and robust, considering partitioned table, maybe later columnar storage are integrated into the core, this model needs some further thoughts. Think about a 1000 partitioned table with 100 columns: that is 100K files, no to speak of other forks - surely we can continue challenging file system's limit or playing around vfds, but we have a chance now to think ahead. Most commercial database employs a DMS storage model, where it manages object mapping and freespace itself. So different objects are sharing storage within several files. Surely it has historic reasons, but it has several advantages over current model: - remove fd pressure - remove double buffering (by introducing ADIO) - controlled layout and access pattern (sequential and read-ahead) - better quota management - performance potentially better Considering platforms supported and the stableness period needed, we shall support both current storage model and DMS model. I will stop here to see if this deserves further discussion. Regards, Qingqing
On Fri, Apr 17, 2015 at 11:19 AM, Qingqing Zhou <zhouqq.postgres@gmail.com> wrote: > > Most commercial database employs a DMS storage model, where it manages > object mapping and freespace itself. So different objects are sharing > storage within several files. Surely it has historic reasons, but it > has several advantages over current model: > - remove fd pressure > - remove double buffering (by introducing ADIO) > - controlled layout and access pattern (sequential and read-ahead) > - better quota management > - performance potentially better > > Considering platforms supported and the stableness period needed, we > shall support both current storage model and DMS model. I will stop > here to see if this deserves further discussion. > Sorry, it might considered double-posting here but I am wondering have we ever discussed this before? If we already have some conclusions on this, could anyone share me a link? Thanks, Qingqing
Hi, I, every now and then, spent a bit of time making this more efficient over the last few weeks. I had a bit of a problem to reproduce the problems I'd seen in production on physical hardware (found EC2 to be to variable to benchmark this), but luckily 2ndQuadrant today allowed me access to their four socket machine[1] of the AXLE project. Thanks Simon and Tomas! First, some mostly juicy numbers: My benchmark was a parallel COPY into a single wal logged target table: CREATE TABLE data(data text); The source data has been generated with narrow: COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY; wide: COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY; Between every test I ran a TRUNCATE data; CHECKPOINT; For each number of clients I ran pgbench for 70 seconds. I'd previously determined using -P 1 that the numbers are fairly stable. Longer runs would have been nice, but then I'd not have finished in time. shared_buffers = 48GB, narrow table contents: client tps after: tps before: 1 180.255577 210.125143 2 338.231058 391.875088 4 638.814300 405.243901 8 1126.852233 370.922271 16 1242.363623 498.487008 32 1229.648854 484.477042 48 1223.288397 468.127943 64 1198.007422 438.238119 96 1201.501278 370.556354 128 1198.554929 288.213032 196 1189.603398 193.841993 256 1144.082291 191.293781 512 643.323675 200.782105 shared_buffers = 1GB, narrow table contents: client tps after: tps before: 1 191.137410 210.787214 2 351.293017 384.086634 4 649.800991 420.703149 8 1103.770749 355.947915 16 1287.192256 489.050768 32 1226.329585 464.936427 48 1187.266489 443.386440 64 1182.698974 402.251258 96 1208.315983 331.290851 128 1183.469635 269.250601 196 1202.847382 202.788617 256 1177.924515 190.876852 512 572.457773 192.413191 1 shared_buffers = 48GB, wide table contents: client tps after: tps before: 1 59.685215 68.445331 2 102.034688 103.210277 4 179.434065 78.982315 8 222.613727 76.195353 16 232.162484 77.520265 32 231.979136 71.654421 48 231.981216 64.730114 64 230.955979 57.444215 96 228.016910 56.324725 128 227.693947 45.701038 196 227.410386 37.138537 256 224.626948 35.265530 512 105.356439 34.397636 shared_buffers = 1GB, wide table contents: (ran out of patience) Note that the peak performance with the patch is significantly better, but there's currently a noticeable regression in single threaded performance. That undoubtedly needs to be addressed. So, to get to the actual meat: My goal was to essentially get rid of an exclusive lock over relation extension alltogether. I think I found a way to do that that addresses the concerns made in this thread. Thew new algorithm basically is: 1) Acquire victim buffer, clean it, and mark it as pinned 2) Get the current size of the relation, save buffer into blockno 3) Try to insert an entry into the buffer table for blockno 4) If the page is already in the buffer table, increment blockno by 1, goto 3) 5) Try to read the page. In most cases it'll not yet exist. But the page might concurrently have been written by another backend and removed from shared buffers already. If already existing, goto 1) 6) Zero out the page on disk. I think this does handle the concurrency issues. This patch very clearly is in the POC stage. But I do think the approach is generally sound. I'd like to see some comments before deciding whether to carry on. Greetings, Andres Freund PS: Yes, I know that precision in the benchmark isn't warranted, but I'm too lazy to truncate them. [1] [10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)
Attachment
Hi, Eeek, the attached patch included a trivial last minute screwup (dereferencing bistate unconditionally...). Fixed version attached. Andres
Attachment
Andres Freund <andres@anarazel.de> writes: > So, to get to the actual meat: My goal was to essentially get rid of an > exclusive lock over relation extension alltogether. I think I found a > way to do that that addresses the concerns made in this thread. > Thew new algorithm basically is: > 1) Acquire victim buffer, clean it, and mark it as pinned > 2) Get the current size of the relation, save buffer into blockno > 3) Try to insert an entry into the buffer table for blockno > 4) If the page is already in the buffer table, increment blockno by 1, > goto 3) > 5) Try to read the page. In most cases it'll not yet exist. But the page > might concurrently have been written by another backend and removed > from shared buffers already. If already existing, goto 1) > 6) Zero out the page on disk. > I think this does handle the concurrency issues. The need for (5) kind of destroys my faith in this really being safe: it says there are non-obvious race conditions here. For instance, what about this scenario: * Session 1 tries to extend file, allocates buffer for page 42 (so it's now between steps 4 and 5). * Session 2 tries to extend file, sees buffer for 42 exists, allocates buffer for page 43 (so it's also now between 4 and 5). * Session 2 tries to read page 43, sees it's not there, and writes out page 43 with zeroes (now it's done). * Session 1 tries to read page 42, sees it's there and zero-filled (not because anybody wrote it, but because holes in files read as 0). At this point session 1 will go and create page 44, won't it, and you just wasted a page. Now we do have mechanisms for reclaiming such pages but they may not kick in until VACUUM, so you could end up with a whole lot of table bloat. Also, the file is likely to end up badly physically fragmented when the skipped pages are finally filled in. One of the good things about the relation extension lock is that the kernel sees the file as being extended strictly sequentially, which it should handle fairly well as far as filesystem layout goes. This way might end up creating a mess on-disk. Perhaps even more to the point, you've added a read() kernel call that was previously not there at all, without having removed either the lseek() or the write(). Perhaps that scales better when what you're measuring is saturation conditions on a many-core machine, but I have a very hard time believing that it's not a significant net loss under less-contended conditions. I'm inclined to think that a better solution in the long run is to keep the relation extension lock but find a way to extend files more than one page per lock acquisition. regards, tom lane
On 2015-07-19 11:28:25 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > So, to get to the actual meat: My goal was to essentially get rid of an > > exclusive lock over relation extension alltogether. I think I found a > > way to do that that addresses the concerns made in this thread. > > > Thew new algorithm basically is: > > 1) Acquire victim buffer, clean it, and mark it as pinned > > 2) Get the current size of the relation, save buffer into blockno > > 3) Try to insert an entry into the buffer table for blockno > > 4) If the page is already in the buffer table, increment blockno by 1, > > goto 3) > > 5) Try to read the page. In most cases it'll not yet exist. But the page > > might concurrently have been written by another backend and removed > > from shared buffers already. If already existing, goto 1) > > 6) Zero out the page on disk. > > > I think this does handle the concurrency issues. > > The need for (5) kind of destroys my faith in this really being safe: it > says there are non-obvious race conditions here. It's not simple, I agree. I'm doubtful that an significantly simpler approach exists. > For instance, what about this scenario: > * Session 1 tries to extend file, allocates buffer for page 42 (so it's > now between steps 4 and 5). > * Session 2 tries to extend file, sees buffer for 42 exists, allocates > buffer for page 43 (so it's also now between 4 and 5). > * Session 2 tries to read page 43, sees it's not there, and writes out > page 43 with zeroes (now it's done). > * Session 1 tries to read page 42, sees it's there and zero-filled > (not because anybody wrote it, but because holes in files read as 0). > > At this point session 1 will go and create page 44, won't it, and you > just wasted a page. My local code now recognizes that case and uses the page. We just need to do an PageIsNew(). > Also, the file is likely to end up badly physically fragmented when the > skipped pages are finally filled in. One of the good things about the > relation extension lock is that the kernel sees the file as being extended > strictly sequentially, which it should handle fairly well as far as > filesystem layout goes. This way might end up creating a mess on-disk. I don't think that'll actually happen with any recent filesystems. Pretty much all of them do delayed allocation. But it definitely is a concern with older filesystems. I've just measured and with ext4 the number of extents per segment in a 300GB relation don't show a significant difference when compared between the existing and new code. We could try to address this by optionally using posix_fallocate() to do the actual extension - then there'll not be sparse regions, but actually allocated disk blocks. > Perhaps even more to the point, you've added a read() kernel call that was > previously not there at all, without having removed either the lseek() or > the write(). Perhaps that scales better when what you're measuring is > saturation conditions on a many-core machine, but I have a very hard time > believing that it's not a significant net loss under less-contended > conditions. Yes, this has me worried too. > I'm inclined to think that a better solution in the long run is to keep > the relation extension lock but find a way to extend files more than > one page per lock acquisition. I doubt that'll help as much. As long as you have to search and write out buffers under an exclusive lock that'll be painful. You might be able to make that an infrequent occurrance by extending in larger amounts and entering the returned pages into the FSM, but you'll have rather noticeable latency increases everytime that happens. And not just in the extending relation - all the other relations will wait for the one doing the extending. We could move that into some background process, but at that point things have gotten seriously complex. The more radical solution would be to have some place in memory that'd store the current number of blocks. Then all the extension specific locking we'd need would be around incrementing that. But how and where to store that isn't easy. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2015-07-19 11:28:25 -0400, Tom Lane wrote: >> At this point session 1 will go and create page 44, won't it, and you >> just wasted a page. > My local code now recognizes that case and uses the page. We just need > to do an PageIsNew(). Er, what? How can you tell whether an all-zero page was or was not just written by another session? regards, tom lane
On 2015-07-19 11:56:47 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-07-19 11:28:25 -0400, Tom Lane wrote: > >> At this point session 1 will go and create page 44, won't it, and you > >> just wasted a page. > > > My local code now recognizes that case and uses the page. We just need > > to do an PageIsNew(). > > Er, what? How can you tell whether an all-zero page was or was not > just written by another session? The check is only done while holding the io lock on the relevant page (have to hold that anyway), after reading it in ourselves, just before setting BM_VALID. As we only can get to that point when there wasn't any other entry for the page in the buffer table, that guarantees that no other backend isn't currently expanding into that page. Others might wait to read it, but those'll wait behind the IO lock. The situation the read() protect us against is that two backends try to extend to the same block, but after one of them succeeded the buffer is written out and reused for an independent page. So there is no in-memory state telling the slower backend that that page has already been used. Andres
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.
On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> >> At this point session 1 will go and create page 44, won't it, and you
> >> just wasted a page.
>
> > My local code now recognizes that case and uses the page. We just need
> > to do an PageIsNew().
>
> Er, what? How can you tell whether an all-zero page was or was not
> just written by another session?
The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.
The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jul 19 2015 9:37 PM Andres Wrote,> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.I was looking into this patch, and done some performance testing..Currently i have done testing my my local machine, later i will perform on big machine once i get access to that.Just wanted to share the current result what i get i my local machineMachine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of RAM).Test Script:./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";./psql -d postgres -c "truncate table data"./psql -d postgres -c "checkpoint"./pgbench -f copy_script -T 120 -c$ -j$ postgres
This time i have done some testing on big machine with 64 physical cores @ 2.13GHz and 50GB of RAM
------------------------
Performance Summary and Analysis:
------------------------------------------------
-----------------------------
-------------------------
--------------------------
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..
Shared Buffer 48 GB | |||
Clients | Base (TPS) | Lock Free Patch | Multi-extend extend_num_pages=5 |
1 | 142 | 138 | 148 |
2 | 251 | 253 | 280 |
4 | 237 | 416 | 464 |
8 | 168 | 491 | 575 |
16 | 141 | 448 | 404 |
32 | 122 | 337 | 332 |
Shared Buffer 64 MB | |||
Clients | Base (TPS) | Multi-extend extend_num_pages=5 | |
1 | 140 | 148 | |
2 | 252 | 266 | |
4 | 229 | 437 | |
8 | 153 | 475 | |
16 | 132 | 364 |
Attachment
On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:On Sun, Jul 19 2015 9:37 PM Andres Wrote,> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.I was looking into this patch, and done some performance testing..Currently i have done testing my my local machine, later i will perform on big machine once i get access to that.Just wanted to share the current result what i get i my local machineMachine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of RAM).Test Script:./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";./psql -d postgres -c "truncate table data"./psql -d postgres -c "checkpoint"./pgbench -f copy_script -T 120 -c$ -j$ postgres
This time i have done some testing on big machine with 64 physical cores @ 2.13GHz and 50GB of RAMmulti-extend patch (this will extend the multiple blocks at a time based on a configuration parameter.)There is performance comparison of base, extend without RelationExtensionLock patch given by Andres andProblem Analysis:
------------------------1. With base code when i try to observe the problem using perf and other method (gdb), i found that RelationExtensionLock is main bottleneck.2. Then after using RelationExtensionLock Free patch, i observed now contention is FileWrite (All backends are trying to extend the file.)
Performance Summary and Analysis:
------------------------------------------------1. In my performance results Multi Extend shown best performance and scalability.2. I think by extending in multiple blocks we solves both the problem(Extension Lock and Parallel File Write).3. After extending one Block immediately adding to FSM so in most of the cases other backend can directly find it without taking extension lock.Currently the patch is in initial stage, i have done only test performance and pass the regress test suit.Open problems
-----------------------------1. After extending the page we are adding it directly to FSM, so if vacuum find this page as new it will give WARNING.2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit, same need to be consider for index cases.Test Script:
-------------------------./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";./psql -d postgres -c "truncate table data"./psql -d postgres -c "checkpoint"./pgbench -f copy_script -T 120 -c$ -j$ postgresPerformanec Data:
--------------------------There are Three code base for performance1. Base Code2. Lock Free Patch : patch given in below thread
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de3. Multi extend patch attached in the mail.
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..may be it will give more control to user if we make it relation property.I will work on the patch for this CF, so adding it to CF.
Shared Buffer 48 GB Clients Base (TPS) Lock Free Patch Multi-extend extend_num_pages=5 1 142 138 148 2 251 253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332 Shared Buffer 64 MB Clients Base (TPS) Multi-extend extend_num_pages=5 1 140 148 2 252 266 4 229 437 8 153 475 16 132 364
--
On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:Performanec Data:
--------------------------There are Three code base for performance1. Base Code2. Lock Free Patch : patch given in below thread
http://www.postgresql.org/message-id/20150719140746.GH25610@awork2.anarazel.de3. Multi extend patch attached in the mail.
#extend_num_pages : This this new config parameter to tell how many extra page extend in case of normal extend..may be it will give more control to user if we make it relation property.I will work on the patch for this CF, so adding it to CF.
Shared Buffer 48 GB Clients Base (TPS) Lock Free Patch Multi-extend extend_num_pages=5 1 142 138 148 2 251 253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332 Shared Buffer 64 MB Clients Base (TPS) Multi-extend extend_num_pages=5 1 140 148 2 252 266 4 229 437 8 153 475 16 132 364 I'm not really sure what this email is trying to say.
On 2016-01-07 16:48:53 +0530, Amit Kapila wrote: > What I could understand from above e-mail is that Dilip has tried to > extend relation multiple-pages-at-a-time and observed that it gives > comparable or better performance as compare to Andres's idea of > lock-free extension and it doesn't regress the low-thread count case. I think it's a worthwhile approach to pursue. But until it actually fixes the problem of leaving around uninitialized pages I don't think it's very meaningful to do performance comparisons. > Now, I think here point to discuss is that there could be multiple-ways > for extending a relation multiple-pages-at-a-time like: > > a. Extend the relation page by page and add it to FSM without initializing > it. I think this is what the current patch of Dilip seems to be doing. If > we > want to go via this route, then we need to ensure that whenever we get > the page from FSM, if it is empty and not initialised, then initialise > it. I think that's pretty much unacceptable, for the non-error path at least. One very simple, linux specific, approach would be to simply do fallocate(FALLOC_FL_KEEP_SIZE) to extend the file, that way space is pre-allocated, but not yet marked as allocated. Greetings, Andres Freund
On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.
> a. Extend the relation page by page and add it to FSM without initializing
> it. I think this is what the current patch of Dilip seems to be doing. If
> we
I think that's pretty much unacceptable, for the non-error path at
least.
----------------------------
Test Case:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres
Test Summary:
--------------------
1. I have measured the performance of base and patch.
2. With patch there are multiple results, that are with different values of "extend_num_pages" (parameter which says how many extra block to extend)
Test with Data on magnetic Disk and WAL on SSD
--------------------------------------------------------------------
Shared Buffer : 48GB
max_wal_size : 10GB
Storage : Magnetic Disk
WAL : SSD
tps with different value of extend_num_page
------------------------------------------------------------
Client Base 10-Page 20-Page 50-Page
1 105 103 157 129
2 217 219 255 288
4 210 421 494 486
8 166 605 702 701
16 145 484 563 686
32 124 477 480 745
Test with Data and WAL on SSD
-----------------------------------------------
Shared Buffer : 48GB
Max Wal Size : 10GB
Storage : SSD
tps with different value of extend_num_page
------------------------------------------------------------
Client Base 10-Page 20-Page 50-Page 100-Page
1 152 153 155 147 157
2 281 281 292 275 287
4 236 505 502 508 514
8 171 662 687 767 764
16 145 527 639 826 907
Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.
Any suggestion on this ?
Apart from this approach, I also tried extending the file in multiple block in one extend call, but this approach (extending one by one) is performing better.
Attachment
On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.Attached patch solves this issue, I am allocating the buffer for each page and initializing the page, only after that adding to FSM.
Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.
Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?
On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:
I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.Attached patch solves this issue, I am allocating the buffer for each page and initializing the page, only after that adding to FSM.Few comments about patch:
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);
Few comments about patch:
1.Patch is not getting compiled.1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifier
I will fix in next version of patch.
2.! page = BufferGetPage(buffer);! PageInit(page, BufferGetPageSize(buf), 0);!! freespace = PageGetHeapFreeSpace(page);!MarkBufferDirty(buffer);! UnlockReleaseBuffer(buffer);!RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);What is the need to mark page dirty here, won't it automaticallybe markerd dirty once the page is used? I think it is requiredif you wish to WAL-log this action.
These pages are not going to be used immediately and we have done PageInit so i think it should be marked dirty before adding to FSM, so that if buffer get replaced out then it flushes the init data.
3. I think you don't need to multi-extend a relation ifHEAP_INSERT_SKIP_FSM is used, as for that case it anyways try toget a new page by extending a relation.
Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in current transaction it will not take pages from FSM and everytime it will do multi-extend, however pages will be used if there are parallel backend, but still not a good idea to extend every time in multiple chunk in current backend.
4. Again why do you need this multi-extend optimization for localrelations (those only accessible to current backend)?
5. Do we need this for nbtree as well? One way to check thatis by Copying large data in table having index.
Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.In that case, can you try the same test with un-logged tables?
Also, it is good to check the performance of patch with read-write workload to ensure that extending relation in multiple-chunks should notregress such cases.
Ok
Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?I think we should have a new storage_parameter at table levelextend_by_blocks or something like that instead of GUC. Thedefault value of this parameter should be 1 which means retaincurrent behaviour by default.
--
I found one more problem with patch.! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);You can't call BufferGetBlockNumber(buffer) after releasingthe pin on buffer which will be released byUnlockReleaseBuffer(). Get the block number before unlockingthe buffer.
1.Patch is not getting compiled.1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared identifierOh, My mistake, my preprocessor is ignoring this error and relacing it with BLKSIZE
FIXED
3. I think you don't need to multi-extend a relation ifHEAP_INSERT_SKIP_FSM is used, as for that case it anyways try toget a new page by extending a relation.So i will change this..
4. Again why do you need this multi-extend optimization for localrelations (those only accessible to current backend)?I think we can change this while adding the table level "extend_by_blocks" for local table we will not allow this property, so no need to change at this place.What do you think ?
Now I have added table level parameter for specifying the number of blocks, So do you think that we still need to block it, as user can control it, Moreover i think if user want to use for local table then no harm in it at least by extending in one shot he avoid multiple call of Extension lock, though there will be no contention.
5. Do we need this for nbtree as well? One way to check thatis by Copying large data in table having index.Ok, i will try this test and update.
Note: Test with both data and WAL on Magnetic Disk : No significant improvement visible
-- I think wall write is becoming bottleneck in this case.In that case, can you try the same test with un-logged tables?
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "create unlogged table data (data text)" --> base
./psql -d postgres -c "create unlogged table data (data text) with(extend_by_blocks=50)" --patch
test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres
Shared Buffer 48GB
Table: Unlogged Table
ench -c$ -j$ -f -M Prepared postgres
Clients Base patch
1 178 180
2 337 338
4 265 601
8 167 805
Also, it is good to check the performance of patch with read-write workload to ensure that extending relation in multiple-chunks should notregress such cases.
Ok
Currently i have kept extend_num_page as session level parameter but i think later we can make this as table property.
Any suggestion on this ?I think we should have a new storage_parameter at table levelextend_by_blocks or something like that instead of GUC. Thedefault value of this parameter should be 1 which means retaincurrent behaviour by default.+1
Attachment
I did not find in regression in normal case.Note: I tested it with previous patch extend_num_pages=10 (guc parameter) so that we can see any impact on overall system.
On Thu, Jan 28, 2016 at 6:23 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > [ new patch ] This patch contains a useless hunk and also code not in PostgreSQL style. Get pgindent set up and it will do it correctly for you, or look at the style of the surrounding code. What I'm a bit murky about is *why* this should be any better than the status quo. I mean, the obvious explanation that pops to mind is that extending the relation by two pages at a time relieves pressure on the relation extension lock somehow. One other possible explanation is that calling RecordPageWithFreeSpace() allows multiple backends to get access to that page at the same time, while otherwise each backend would try to conduct a separate extension. But in the first case, what we ought to do is try to make the locking more efficient; and in the second case, we might want to think about recording the first page in the free space map too. I don't think the problem here is that w Here's a sketch of another approach to this problem. Get rid of the relation extension lock. Instead, have an array of, say, 256 lwlocks. Each one protects the extension of relations where hash(relfilenode) % 256 maps to that lock. To extend a relation, grab the corresponding lwlock, do the work, then release the lwlock. You might occasionally have a situation where two relations are both being extended very quickly and happen to map to the same bucket, but that shouldn't be much of a problem in practice, and the way we're doing it presently is worse, not better, since two relation extension locks may very easily map to the same lock manager partition. The only problem with this is that acquiring an LWLock holds off interrupts, and we don't want interrupts to be locked out across a potentially lengthy I/O. We could partially fix that if we call RESUME_INTERRUPTS() after acquiring the lock and HOLD_INTERRUPTS() just before releasing it, but there's still the problem that you might block non-interruptibly while somebody else has the lock. I don't see an easy solution to that problem right off-hand, but if something like this performs well we can probably conjure up some solution to that problem. I'm not saying that we need to do that exact thing - but I am saying that I don't think we can proceed with an approach like this without first understanding why it works and whether there's some other way that might be better to address the underlying problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote: > test_script: > ------------ > ./psql -d postgres -c "truncate table data" > ./psql -d postgres -c "checkpoint" > ./pgbench -f copy_script -T 120 -c$ -j$ postgres > > Shared Buffer 48GB > Table: Unlogged Table > ench -c$ -j$ -f -M Prepared postgres > > Clients Base patch > 1 178 180 > 2 337 338 > 4 265 601 > 8 167 805 Could you also measure how this behaves for an INSERT instead of a COPY workload? Both throughput and latency. It's quite possible that this causes latency hikes, because suddenly backends will have to wait for one other to extend by 50 pages. You'll probably have to use -P 1 or full statement logging to judge that. I think just having a number of connections inserting relatively wide rows into one table should do the trick. I'm doubtful that anything that does the victim buffer search while holding the extension lock will actually scale in a wide range of scenarios. The copy scenario here probably isn't too bad because the copy ring buffes are in use, and because there's no reads increasing the usagecount of recent buffers; thus a victim buffers are easily found. Thanks, Andres
On 2016-02-02 10:12:38 -0500, Robert Haas wrote: > Here's a sketch of another approach to this problem. Get rid of the > relation extension lock. Instead, have an array of, say, 256 lwlocks. > Each one protects the extension of relations where hash(relfilenode) % > 256 maps to that lock. To extend a relation, grab the corresponding > lwlock, do the work, then release the lwlock. You might occasionally > have a situation where two relations are both being extended very > quickly and happen to map to the same bucket, but that shouldn't be > much of a problem in practice, and the way we're doing it presently is > worse, not better, since two relation extension locks may very easily > map to the same lock manager partition. I guess you suspect that the performance problems come from the heavyweight lock overhead? That's not what I *think* I've seen in profiles, but it's hard to conclusively judge that. I kinda doubt that really solves the problem, profiles aside, though. The above wouldn't really get rid of the extension locks, it just changes the implementation a bit. We'd still do victim buffer search, and filesystem operations, while holding an exclusive lock. Batching can solve some of that, but I think primarily we need more granular locking, or get rid of locks entirely. > The only problem with this is that acquiring an LWLock holds off > interrupts, and we don't want interrupts to be locked out across a > potentially lengthy I/O. We could partially fix that if we call > RESUME_INTERRUPTS() after acquiring the lock and HOLD_INTERRUPTS() > just before releasing it, but there's still the problem that you might > block non-interruptibly while somebody else has the lock. I don't see > an easy solution to that problem right off-hand, but if something like > this performs well we can probably conjure up some solution to that > problem. Hm. I think to get rid of the HOLD_INTERRUPTS() we'd have to to record what lock we were waiting on, and in which mode, before going into PGSemaphoreLock(). Then LWLockReleaseAll() could "hand off" the wakeup to the next waiter in the queue. Without that we'd sometimes end up with absorbing a wakeup without then releasing the lock, causing everyone to block on a released lock. There's probably two major questions around that: Will it have a performance impact, and will there be any impact on existing callers? Regards, Andres
On Tue, Feb 2, 2016 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote: > I'm doubtful that anything that does the victim buffer search while > holding the extension lock will actually scale in a wide range of > scenarios. The copy scenario here probably isn't too bad because the > copy ring buffes are in use, and because there's no reads increasing the > usagecount of recent buffers; thus a victim buffers are easily found. I agree that's an avenue we should try to explore. I haven't had any time to think much about how it should be done, but it seems like it ought to be possible somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund wrote: > Could you also measure how this behaves for [...] While we're proposing benchmark cases -- I remember this being an issue with toast tables getting very large values of xml which causes multiple toast pages to be extended for each new value inserted. If there are multiple processes inserting these all the time, things get clogged. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:
> > test_script:
> > ------------
> > ./psql -d postgres -c "truncate table data"
> > ./psql -d postgres -c "checkpoint"
> > ./pgbench -f copy_script -T 120 -c$ -j$ postgres
> >
> > Shared Buffer 48GB
> > Table: Unlogged Table
> > ench -c$ -j$ -f -M Prepared postgres
> >
> > Clients Base patch
> > 1 178 180
> > 2 337 338
> > 4 265 601
> > 8 167 805
>
> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.
> Could you also measure how this behaves for an INSERT instead of a COPY
> workload?
I think such a test will be useful.
Observation:
------------------
Apart from this test I have also used some tool which can take a many stack traces with some delay.
1. I have observed with base code (when data don't fits in shared buffer) almost all the stack traces are waiting on "LockRelationForExtension" and
many on "FlushBuffer" also (Flushing the dirty buffer).
Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187
(This test run with 8 thread (shared buf 512MB) and after every 5 second stack is captured.)
2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but still LockRelationForExtension remains in very high number.
3.Performance of base code in both the cases when Data fits in shared buffers or doesn't fits in shared buffer remain very low and non-scaling(we can see that in below results).
Test--1 (big record insert and Data fits in shared Buffer)
------------------------------------------------------------
setup
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=$variable)" {create table data (a int) for base code}
echo "insert into data select * from test_data;" >> copy_script
test:
-----
shared_buffers=48GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres
client base extend_by_block=50 extend_by_block=1000
1 113 115 118
4 50 220 216
8 43 202 302
Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
setup:
-------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script
test:
------
shared_buffers=512MB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres
client base extend_by_block=1000
1 125 125
4 49 236
8 41 294
16 39 279
Test--3 (small record insert and Data fits in shared Buffer)
------------------------------------------------------------------
setup:
--------
./psql -d postgres -c "create table test_data(a int)"
./psql -d postgres -c "insert into test_data values(generate_series(1,10000))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)"
echo "insert into data select * from test_data;" >> copy_script
test:
-----
shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres
client base Patch-extend_by_block=20
1 137 143
2 269 250
4 377 443
8 170 690
16 145 745
*All test done with Data on MD and Wal on SSD
Note: Last patch have max limit of extend_by_block=100 so for taking performance with extend_by_block=1000 i localy changed it.
I will send the modified patch once we finalize on which approach to proceed with.
> I'm doubtful that anything that does the victim buffer search while
> holding the extension lock will actually scale in a wide range of
> scenarios.>I think the problem for victim buffer could be visible if the blocksare dirty and it needs to write the dirty buffer and especially asthe patch is written where after acquiring the extension lock, it againtries to extend the relation without checking if it can get a page withspace from FSM. It seems to me that we should re-check theavailability of page because while one backend is waiting on extensionlock, other backend might have added pages. To re-check theavailability we might want to use something similar toLWLockAcquireOrWait() semantics as used during WAL writing.
On 2016-02-10 10:32:44 +0530, Dilip Kumar wrote: > On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Could you also measure how this behaves for an INSERT instead of a COPY > > > workload? > > > > > I think such a test will be useful. > > > > I have measured the performance with insert to see the behavior when it > don't use strategy. I have tested multiple option, small tuple, big tuple, > data fits in shared buffer and doesn't fit in shared buffer. Could you please also have a look at the influence this has on latency? I think you unfortunately have to look at the per-transaction logs, and then see whether the worst case latencies get better or worse. Greetings, Andres Freund
Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.
(I selected below case to find the worst case latency because in this case
we are extending by 1000 blocks and data doesn't fits in shared buffer)
Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script
shared_buffers=512B -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c 8 -j 8 -f copy_script -T -l 120 postgres
base patch(extend 1000)
best 23245 3857
worst 236329 382859
Average 190303 35143
I have also attached the pgbench log files
patch_1000.tar --> log files with patch extend by 1000 blocks
base.tar --> log files with base code
From attached files we can see that very few transactions latency with patch is high(> 300,000) which is expected and that too when we are
extensing 1000 blocks, And we base code almost every transaction latency is hight (>200,000) that we can see that best case and average case latency is 1/5 with extend by 1000.
Attachment
I have tested Relation extension patch from various aspects and performance results and other statistical data are explained in the mail.
Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context Switch
1. I converted the RelationExtensionLock to simple LWLock and tested with single Relation. Results are as below
This is the simple script of copy 10000 record in one transaction of size 4 Bytes
client base lwlock multi_extend by 50 block
1 155 156 160
2 282 276 284
4 248 319 428
8 161 267 675
16 143 241 889
LWLock performance is better than base, obvious reason may be because we have saved some instructions by converting to LWLock but it don't scales any better compared to base code.
Test2: Identify that improvement in case of multiextend is becuase of avoiding context switch or some other factor, like reusing blocks b/w backend by putting in FSM..
1. Test by just extending multiple blocks and reuse in it's own backend (Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer
Client Base Extend 800 block self use Extend 1000 Block
1 117 131 118
2 111 203 140
3 51 242 178
4 51 231 190
5 52 259 224
6 51 263 243
7 43 253 254
8 43 240 254
16 40 190 243
We can see the same improvement in case of self using the blocks also, It shows that Sharing the blocks between the backend was not the WIN but avoiding context switch was the measure win.
2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4 Bytes
2 280 457,506 311 62,641
3 235 1,098,701 358 141,624
4 216 1,155,735 368 188,173
Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running
once I get the data will post the same.
Posting the re-based version and moving to next CF.
Open points:
1. After getting the Lock recheck the FSM if some other back end has already added extra blocks and reuse them.
2. Is it good idea to have user level parameter for extend_by_block or we can try some approach to internally identify how many blocks are needed and as per the need only add the blocks, this will make it more flexible.
--
Attachment
On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Test2: Identify that improvement in case of multiextend is becuase of avoiding context switch or some other factor, like reusing blocks b/w backend by putting in FSM..
1. Test by just extending multiple blocks and reuse in it's own backend (Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer
Client Base Extend 800 block self use Extend 1000 Block
1 117 131 118
2 111 203 140
3 51 242 178
4 51 231 190
5 52 259 224
6 51 263 243
7 43 253 254
8 43 240 254
16 40 190 243
We can see the same improvement in case of self using the blocks also, It shows that Sharing the blocks between the backend was not the WIN but avoiding context switch was the measure win.
2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4 BytesBASE CODE PATCH MULTI EXTENDClient Base_TPS ProcSleep Count Extend By 10 Block Proc Sleep Count
2 280 457,506 311 62,641
3 235 1,098,701 358 141,624
4 216 1,155,735 368 188,173What we can see in above test that, in Base code performance is degrading after 2 threads, while Proc Sleep count in increasing with huge amount.Compared to that in Patch, with extending 10 blocks at a time Proc Sleep reduce to ~1/8 and we can see it is constantly scaling.
Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running
once I get the data will post the same.
One thing that is slightly unclear is that whether there is any overhead due to buffer eviction especially when the buffer to be evicted is already dirty and needs XLogFlush(). One reason why it might not hurt is that by the time we tries to evict the buffer, corresponding WAL is already flushed by WALWriter or other possibility could be that even if it is getting done during buffer eviction, the impact for same is much lesser. Can we try to measure the number of flush calls which happen during buffer eviction?
Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently runningonce I get the data will post the same.Okay. However, I wonder if the performance data for the case when data doesn't fit into shared buffer also shows similar trend, then it might be worth to try by doing extend w.r.t load in system. I mean to say we can batch the extension requests (as we have done in ProcArrayGroupClearXid) and extend accordingly, if that works out then the benefit could be that we don't need any configurable knob for the same.
2. Other option can be that we analyse the current Load on the extend and then speed up or slow down the extending speed.
Simple algorithm can look like this
If (GotBlockFromFSM())
Success++ // We got the block from FSM
Else
Failure++ // Did not get Block from FSM and need to extend by my self
Now while extending
-------------------------
Speed up
-------------
If (failure - success > Threshold ) // Threshold can be one number assume 10.
ExtendByBlock += failure- success; --> If many failure means load is high then Increase the ExtendByBlock
Failure = success= 0 --> reset after this so that we can measure the latest trend.
Slow down..
--------------
//Now its possible that demand of block is reduced but ExtendByBlock is still high.. So analyse statistic and slow down the extending pace..
// Can not reduce it by big number because, may be more request are satisfying because this is correct amount, so gradually decrease the pace and re-analyse the statistics next time.
ExtendByBlock --;
Failure = success= 0
1. One option can be as you suggested like ProcArrayGroupClearXid, With some modification, because when we wait for the request and extend w.r.t that, may be again we face the Context Switch problem, So may be we can extend in some multiple of the Request.(But we need to take care whether to give block directly to requester or add it into FSM or do both i.e. give at-least one block to requester and put some multiple in FSM)
2. Other option can be that we analyse the current Load on the extend and then speed up or slow down the extending speed.
Simple algorithm can look like this
I have tried the approach of group extend,
1. We convert the extension lock to TryLock and if we get the lock then extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one process will extend for all, Slight change from ProcArrayGroupClear is that here other than satisfying the requested backend we Add some extra blocks in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like group size tell us exactly the contention size and we extend in multiple of that.
Performance Analysis:
---------------------
Performance is scaling with this approach, its slightly less compared to previous patch where we directly give extend_by_block parameter and extend in multiple blocks, and I think that's obvious because in group extend case only after contention happen on lock we add extra blocks, but in former case it was extending extra blocks optimistically.
Test1(COPY)
-----
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./pgbench -f copy_script -T 300 -c$ -j$ postgres
Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD
-----------------------------------------------------------------------------------------------
Client Base multi_extend by 20 page group_extend_patch(groupsize*10)
1 105 157 149
2 217 255 282
4 210 494 452
8 166 702 629
16 145 563 561
32 124 480 566
Test2(INSERT)
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script
shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres
Client Base Multi-extend by 1000 *group extend (group*10) *group extend (group*100)
1 117 118 125 122
2 111 140 161 159
4 51 190 141 187
8 43 254 148 172
16 40 243 150 173
------------
Attachment
On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have tried the approach of group extend, > > 1. We convert the extension lock to TryLock and if we get the lock then > extend by one block.2. > 2. If don't get the Lock then use the Group leader concep where only one > process will extend for all, Slight change from ProcArrayGroupClear is that > here other than satisfying the requested backend we Add some extra blocks in > FSM, say GroupSize*10. > 3. So Actually we can not get exact load but still we have some factor like > group size tell us exactly the contention size and we extend in multiple of > that. This approach seems good to me, and the performance results look very positive. The nice thing about this is that there is not a user-configurable knob; the system automatically determines when larger extensions are needed, which will mean that real-world users are much more likely to benefit from this. I don't think it matters that this is a little faster or slower than an approach with a manual knob; what matter is that it is a huge improvement over unpatched master, and that it does not need a knob. The arbitrary constant of 10 is a little unsettling but I think we can live with it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > This approach seems good to me, and the performance results look very > positive. The nice thing about this is that there is not a > user-configurable knob; the system automatically determines when > larger extensions are needed, which will mean that real-world users > are much more likely to benefit from this. I don't think it matters > that this is a little faster or slower than an approach with a manual > knob; what matter is that it is a huge improvement over unpatched > master, and that it does not need a knob. The arbitrary constant of > 10 is a little unsettling but I think we can live with it. +1. "No knob" is a huge win. regards, tom lane
>
> On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have tried the approach of group extend,
> >
> > 1. We convert the extension lock to TryLock and if we get the lock then
> > extend by one block.2.
> > 2. If don't get the Lock then use the Group leader concep where only one
> > process will extend for all, Slight change from ProcArrayGroupClear is that
> > here other than satisfying the requested backend we Add some extra blocks in
> > FSM, say GroupSize*10.
> > 3. So Actually we can not get exact load but still we have some factor like
> > group size tell us exactly the contention size and we extend in multiple of
> > that.
>
> This approach seems good to me, and the performance results look very
> positive. The nice thing about this is that there is not a
> user-configurable knob; the system automatically determines when
> larger extensions are needed, which will mean that real-world users
> are much more likely to benefit from this.
+{
+ volatile PROC_HDR *procglobal = ProcGlobal;
+ uint32 nextidx;
+ uint32 wakeidx;
+ int extraWaits = -1;
+ BlockNumber targetBlock;
+ int count = 0;
+
+ /* Add ourselves to the list of processes needing a group extend. */
+ proc->groupExtendMember = true;
..
..
+ /* We are the leader. Acquire the lock on behalf of everyone. */
+ LockRelationForExtension(relation, ExclusiveLock);
To provide it for multiple relations, I think you need to advocate the reloid for relation in each proc and then get the relation descriptor for relation extension lock.
On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think one thing which needs more thoughts about this approach is that we > need to maintain some number of slots so that group extend for different > relations can happen in parallel. Do we want to provide simultaneous > extension for 1, 2, 3, 4, 5 or more number of relations? I think providing > it for three or four relations should be okay as higher the number we want > to provide, bigger the size of PGPROC structure will be. Hmm. Can we drive this off of the heavyweight lock manager's idea of how big the relation extension lock wait queue is, instead of adding more stuff to PGPROC? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think one thing which needs more thoughts about this approach is that we
> > need to maintain some number of slots so that group extend for different
> > relations can happen in parallel. Do we want to provide simultaneous
> > extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
> > it for three or four relations should be okay as higher the number we want
> > to provide, bigger the size of PGPROC structure will be.
>
> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> how big the relation extension lock wait queue is, instead of adding
> more stuff to PGPROC?
>
On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Hmm. Can we drive this off of the heavyweight lock manager's idea of >> how big the relation extension lock wait queue is, instead of adding >> more stuff to PGPROC? > > One idea to make it work without adding additional stuff in PGPROC is that > after acquiring relation extension lock, check if there is any available > block in fsm, if it founds any block, then release the lock and proceed, > else extend the relation by one block and then check lock's wait queue size > or number of lock requests (nRequested) and extend the relation further in > proportion to wait queue size and then release the lock and proceed. Here, > I think we can check for wait queue size even before extending the relation > by one block. > > The benefit of doing it with PGPROC is that there will be relatively less > number LockAcquire calls as compare to heavyweight lock approach, which I > think should not matter much because we are planing to extend the relation > in proportion to wait queue size (probably wait queue size * 10). I don't think switching relation extension from heavyweight locks to lightweight locks is going to work. It would mean, for example, that we lose the ability to service interrupts while extending a relation; not to mention that we lose scalability if many relations are being extended at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue size
> > or number of lock requests (nRequested) and extend the relation further in
> > proportion to wait queue size and then release the lock and proceed. Here,
> > I think we can check for wait queue size even before extending the relation
> > by one block.
> >
> > The benefit of doing it with PGPROC is that there will be relatively less
> > number LockAcquire calls as compare to heavyweight lock approach, which I
> > think should not matter much because we are planing to extend the relation
> > in proportion to wait queue size (probably wait queue size * 10).
>
> I don't think switching relation extension from heavyweight locks to
> lightweight locks is going to work.
> >> Hmm. Can we drive this off of the heavyweight lock manager's idea of
> >> how big the relation extension lock wait queue is, instead of adding
> >> more stuff to PGPROC?
> >
> > One idea to make it work without adding additional stuff in PGPROC is that
> > after acquiring relation extension lock, check if there is any available
> > block in fsm, if it founds any block, then release the lock and proceed,
> > else extend the relation by one block and then check lock's wait queue size
> > or number of lock requests (nRequested) and extend the relation further in
> > proportion to wait queue size and then release the lock and proceed. Here,
> > I think we can check for wait queue size even before extending the relation
> > by one block.
Attachment
On Tue, Mar 8, 2016 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have come up with this patch.. > > If this approach looks fine then I will prepare final patch (more comments, > indentation, and improve some code) and do some long run testing (current > results are 5 mins run). > > Idea is same what Robert and Amit suggested up thread. So this seems promising, but I think the code needs some work. LockWaiterCount() bravely accesses a shared memory data structure that is mutable with no locking at all. That might actually be safe for our purposes, but I think it would be better to take the partition lock in shared mode if it doesn't cost too much performance. If that's too expensive, then it should at least have a comment explaining (1) that it is doing this without the lock and (2) why that's safe (sketch: the LOCK can't go away because we hold it, and nRequested could change but we don't mind a stale value, and a 32-bit read won't be torn). A few of the other functions in here also lack comments, and perhaps should have them. The changes to RelationGetBufferForTuple need a visit from the refactoring police. Look at the way you are calling RelationAddOneBlock. The logic looks about like this: if (needLock) { if (trylock relation for extension) RelationAddOneBlock(); else { lock relation for extension; if (use fsm) { complicated; } else RelationAddOneBlock(); } else RelationAddOneBlock(); So basically you want to do the RelationAddOneBlock() thing if !needLock || !use_fsm || can't trylock. See if you can rearrange the code so that there's only one fallthrough call to RelationAddOneBlock() instead of three separate ones. Also, consider moving the code that adds multiple blocks at a time to its own function instead of including it all in line. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all. That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance. If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).
A few of the other functions in here also lack comments, and perhaps
should have them.
The changes to RelationGetBufferForTuple need a visit from the
refactoring police. Look at the way you are calling
RelationAddOneBlock. The logic looks about like this:
if (needLock)
{
if (trylock relation for extension)
RelationAddOneBlock();
else
{
lock relation for extension;
if (use fsm)
{
complicated;
}
else
RelationAddOneBlock();
}
else
RelationAddOneBlock();
So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock. See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.
Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.
Attachment
On 10/03/16 02:53, Dilip Kumar wrote: > > Attaching a latest patch. > Hmm, why did you remove the comment above the call to UnlockRelationForExtension? It still seems relevant, maybe with some minor modification? Also there is a bit of whitespace mess inside the conditional lock block. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hmm, why did you remove the comment above the call to UnlockRelationForExtension?
It still seems relevant, maybe with some minor modification?
Also there is a bit of whitespace mess inside the conditional lock block.
Results For 10 Mins run of COPY 10000 records of size 4 bytes script and configuration are same as used in up thread
--------------------------------------------------------------------------------------------
Client Base Patch
1 105 111
2 217 246
4 210 428
8 166 653
16 145 808
32 124 988
Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data don't fits in shared buffer)
--------------------------------------------------------------------------------------------------
Client Base Patch
1 117 120
2 111 126
4 51 130
8 43 147
16 40 209
32 --- 254
64 --- 205
--
Attachment
On 10/03/16 09:57, Dilip Kumar wrote: > > On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr@2ndquadrant.com > <mailto:petr@2ndquadrant.com>> wrote: > > Thanks for the comments.. > > Hmm, why did you remove the comment above the call to > UnlockRelationForExtension? > > While re factoring I lose this comment.. Fixed it > > It still seems relevant, maybe with some minor modification? > > Also there is a bit of whitespace mess inside the conditional lock > block. > > Fixed > > I got the result of 10 mins run so posting it.. > Note: Base code results are copied from up thread... > > Results For 10 Mins run of COPY 10000 records of size 4 bytes script and > configuration are same as used in up thread > -------------------------------------------------------------------------------------------- > > Client Base Patch > 1 105 111 > 2 217 246 > 4 210 428 > 8 166 653 > 16 145 808 > 32 124 988 > 64 --- 974 > > > Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data > don't fits in shared buffer) > -------------------------------------------------------------------------------------------------- > > Client Base Patch > 1 117 120 > 2 111 126 > 4 51 130 > 8 43 147 > 16 40 209 > 32 --- 254 > 64 --- 205 > Those look good. The patch looks good in general now. I am bit scared by the lockWaiters * 20 as it can result in relatively big changes in rare corner cases when for example a lot of backends were waiting for lock on relation and suddenly all try to extend it. I wonder if we should clamp it to something sane (although what's sane today might be small in couple of years). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <spandir="ltr"><<a href="mailto:petr@2ndquadrant.com" target="_blank">petr@2ndquadrant.com</a>></span> wrote:</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">Thanks for looking..</div><div class="gmail_quote"><br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thoselook good. The patch looks good in general now. I am bit scared by the lockWaiters * 20 as itcan result in relatively big changes in rare corner cases when for example a lot of backends were waiting for lock on relationand suddenly all try to extend it. I wonder if we should clamp it to something sane (although what's sane today mightbe small in couple of years).</blockquote></div><div class="gmail_extra"><br /></div>But in such condition when allare waiting on lock, then at a time only one waiter will get the lock and that will easily count the lock waiter and extendin multiple of that. And we also have the check that if any waiter get the lock it will first check in FSM that anybodyelse have added one block or not..</div><div class="gmail_extra"><br /></div><div class="gmail_extra">And other waiterwill not get lock unless first waiter extend all blocks and release the locks.</div><div class="gmail_extra"><br /></div><divclass="gmail_extra">One possible case is as soon as we extend the blocks new requester directly find in FSM anddon't come for lock, and old waiter after getting lock don't find in FSM, But IMHO in such cases, also its good that otherwaiter also extend more blocks (because this can happen when request flow is very high).</div><div class="gmail_extra"><br/></div><div class="gmail_extra"><br /></div><div class="gmail_extra">-- <br /><div class="gmail_signature"><divdir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px" target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>
On 11/03/16 02:44, Dilip Kumar wrote: > > On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr@2ndquadrant.com > <mailto:petr@2ndquadrant.com>> wrote: > > Thanks for looking.. > > Those look good. The patch looks good in general now. I am bit > scared by the lockWaiters * 20 as it can result in relatively big > changes in rare corner cases when for example a lot of backends were > waiting for lock on relation and suddenly all try to extend it. I > wonder if we should clamp it to something sane (although what's sane > today might be small in couple of years). > > > But in such condition when all are waiting on lock, then at a time only > one waiter will get the lock and that will easily count the lock waiter > and extend in multiple of that. And we also have the check that if any > waiter get the lock it will first check in FSM that anybody else have > added one block or not.. > I am not talking about extension locks, the lock queue can be long because there is concurrent DDL for example and then once DDL finishes suddenly 100 connections that tried to insert into table will try to get extension lock and this will add 2000 new pages when much fewer was actually needed. I guess that's fine as it's corner case and it's only 16MB even in such extreme case. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: > I am not talking about extension locks, the lock queue can be long because > there is concurrent DDL for example and then once DDL finishes suddenly 100 > connections that tried to insert into table will try to get extension lock > and this will add 2000 new pages when much fewer was actually needed. I > guess that's fine as it's corner case and it's only 16MB even in such > extreme case. I don't really understand this part about concurrent DDL. If there were concurrent DDL going on, presumably other backends would be blocked on the relation lock, not the relation extension lock - and it doesn't seem likely that you'd often have a huge pile-up of inserters waiting on concurrent DDL. But I guess it could happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/03/16 22:29, Robert Haas wrote: > On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >> I am not talking about extension locks, the lock queue can be long because >> there is concurrent DDL for example and then once DDL finishes suddenly 100 >> connections that tried to insert into table will try to get extension lock >> and this will add 2000 new pages when much fewer was actually needed. I >> guess that's fine as it's corner case and it's only 16MB even in such >> extreme case. > > I don't really understand this part about concurrent DDL. If there > were concurrent DDL going on, presumably other backends would be > blocked on the relation lock, not the relation extension lock - and it > doesn't seem likely that you'd often have a huge pile-up of inserters > waiting on concurrent DDL. But I guess it could happen. > Yeah I was thinking about the latter part and as I said it's very rare case, but I did see something similar couple of times in the wild. It's not objection against committing this patch though, in fact I think it can be committed as is. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3/11/16 5:14 PM, Petr Jelinek wrote: >> I don't really understand this part about concurrent DDL. If there >> were concurrent DDL going on, presumably other backends would be >> blocked on the relation lock, not the relation extension lock - and it >> doesn't seem likely that you'd often have a huge pile-up of inserters >> waiting on concurrent DDL. But I guess it could happen. >> > > Yeah I was thinking about the latter part and as I said it's very rare > case, but I did see something similar couple of times in the wild. It's > not objection against committing this patch though, in fact I think it > can be committed as is. FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates. I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages. BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates.
I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages.
BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate.
>
>
> On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates.
>>
>> I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages.
>
>
> I agree, We can have some max limit on number of extra pages, What other thinks ?
>
>
>>
>> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate.
>
>
> *1 will not solve this problem, Here the main problem was many people are sleep/wakeup on the extension lock and that was causing the bottleneck. So if we do *1 this will satisfy only current requesters which has already waited on the lock. But our goal is to avoid backends from requesting this lock.
>
> Idea of Finding the requester to get the statistics on this locks (load on the lock) and extend in multiple of load so that in future this situation will be avoided for long time and again when happen next time extend in multiple of load.
>
> How 20 comes ?
> I tested with Multiple clients loads 1..64, with multiple load size 4 byte records to 1KB Records, COPY/ INSERT and found 20 works best.
>
Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier you have tried, so that it is clear that 20 is best?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 12/03/16 01:01, Jim Nasby wrote: > On 3/11/16 5:14 PM, Petr Jelinek wrote: >>> I don't really understand this part about concurrent DDL. If there >>> were concurrent DDL going on, presumably other backends would be >>> blocked on the relation lock, not the relation extension lock - and it >>> doesn't seem likely that you'd often have a huge pile-up of inserters >>> waiting on concurrent DDL. But I guess it could happen. >>> >> >> Yeah I was thinking about the latter part and as I said it's very rare >> case, but I did see something similar couple of times in the wild. It's >> not objection against committing this patch though, in fact I think it >> can be committed as is. > > FWIW, this is definitely a real possibility in any shop that has very > high downtime costs and high transaction rates. > > I also think some kind of clamp is a good idea. It's not that uncommon > to run max_connections significantly higher than 100, so the extension > could be way larger than 16MB. In those cases this patch could actually > make things far worse as everyone backs up waiting on the OS to extend > many MB when all you actually needed were a couple dozen more pages. > Well yeah I've seen 10k, but not everything will write to same table, wanted to stay in realms of something that has realistic chance of happening. > BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent > demand for extension that means you're running lots of small DML > operations, not really big ones. I'd think that would make *1 more > appropriate. The benchmarks I've seen showed you want at least *10 and *20 was better. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12/03/16 03:46, Dilip Kumar wrote: > > On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com > <mailto:Jim.Nasby@bluetreble.com>> wrote: > > FWIW, this is definitely a real possibility in any shop that has > very high downtime costs and high transaction rates. > > I also think some kind of clamp is a good idea. It's not that > uncommon to run max_connections significantly higher than 100, so > the extension could be way larger than 16MB. In those cases this > patch could actually make things far worse as everyone backs up > waiting on the OS to extend many MB when all you actually needed > were a couple dozen more pages. > > > I agree, We can have some max limit on number of extra pages, What other > thinks ? > Well, that's what I meant with clamping originally. I don't know what is a good value though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila <spandir="ltr"><<a href="mailto:amit.kapila16@gmail.com" target="_blank">amit.kapila16@gmail.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Canyou post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier you have tried,so that it is clear that 20 is best?</blockquote></div><br /></div><div class="gmail_extra">I had Tried with 1, 10,20 and 50.<br /><br /></div><div class="gmail_extra">1. With base code it was almost the same as base code.<br /></div><divclass="gmail_extra"><br />2. With 10 thread data it matching with my previous group extend patch data postedupthread<br /><a href="http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com">http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com</a><br /><br/></div><div class="gmail_extra">3. Beyond 20 with 50 I did not see any extra benefit in performance number (comparedto 20 when tested 50 with 4 byte COPY I did not tested other data size with 50.).<br /><br />-- <br /><div class="gmail_signature"><divdir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px" target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>
On 3/11/16 9:57 PM, Petr Jelinek wrote: >> I also think some kind of clamp is a good idea. It's not that >> uncommon to run max_connections significantly higher than 100, so >> the extension could be way larger than 16MB. In those cases this >> patch could actually make things far worse as everyone backs up >> waiting on the OS to extend many MB when all you actually needed >> were a couple dozen more pages. >> >> >> I agree, We can have some max limit on number of extra pages, What other >> thinks ? >> > > Well, that's what I meant with clamping originally. I don't know what is > a good value though. Well, 16MB is 2K pages, which is what you'd get if 100 connections were all blocked and we're doing 20 pages per waiter. That seems like a really extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be hit in most cases, unlikely to put a ton of stress on IO, even with magnetic media (assuming the whole 4MB is queued to write in one shot...). 4MB would still reduce the number of locks by 500x. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <span dir="ltr"><<ahref="mailto:Jim.Nasby@bluetreble.com" target="_blank">Jim.Nasby@bluetreble.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well, 16MBis 2K pages, which is what you'd get if 100 connections were all blocked and we're doing 20 pages per waiter. That seemslike a really extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be hit in most cases, unlikelyto put a ton of stress on IO, even with magnetic media (assuming the whole 4MB is queued to write in one shot...).4MB would still reduce the number of locks by 500x.</blockquote></div><br /></div><div class="gmail_extra">In myperformance results given up thread, we are getting max performance at 32 clients, means at a time we are extending 32*20~= max (600) pages at a time. So now with 4MB limit (max 512 pages) Results will looks similar. So we need to take adecision whether 4MB is good limit, should I change it ?<br /><br /><br />-- <br /><div class="gmail_signature"><div dir="ltr"><spanstyle="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px" target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>
On 14/03/16 03:29, Dilip Kumar wrote: > > On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com > <mailto:Jim.Nasby@bluetreble.com>> wrote: > > Well, 16MB is 2K pages, which is what you'd get if 100 connections > were all blocked and we're doing 20 pages per waiter. That seems > like a really extreme scenario, so maybe 4MB is a good compromise. > That's unlikely to be hit in most cases, unlikely to put a ton of > stress on IO, even with magnetic media (assuming the whole 4MB is > queued to write in one shot...). 4MB would still reduce the number > of locks by 500x. > > > In my performance results given up thread, we are getting max > performance at 32 clients, means at a time we are extending 32*20 ~= max > (600) pages at a time. So now with 4MB limit (max 512 pages) Results > will looks similar. So we need to take a decision whether 4MB is good > limit, should I change it ? > > Well any value we choose will be very arbitrary. If we look at it from the point of maximum absolute disk space we allocate for relation at once, the 4MB limit would represent 2.5 orders of magnitude change. That sounds like enough for one release cycle, I think we can further tune it if the need arises in next one. (with my love for round numbers I would have suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as well) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Well any value we choose will be very arbitrary. If we look at it from the point of maximum absolute disk space we allocate for relation at once, the 4MB limit would represent 2.5 orders of magnitude change. That sounds like enough for one release cycle, I think we can further tune it if the need arises in next one. (with my love for round numbers I would have suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as well)
Attachment
On 17/03/16 04:42, Dilip Kumar wrote: > > On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr@2ndquadrant.com > <mailto:petr@2ndquadrant.com>> wrote: > > Well any value we choose will be very arbitrary. If we look at it > from the point of maximum absolute disk space we allocate for > relation at once, the 4MB limit would represent 2.5 orders of > magnitude change. That sounds like enough for one release cycle, I > think we can further tune it if the need arises in next one. (with > my love for round numbers I would have suggested 8MB as that's 3 > orders of magnitude, but I am fine with 4MB as well) > > > I have modified the patch, this contains the max limit on extra pages, > 512(4MB) pages is the max limit. > > I have measured the performance also and that looks equally good. > Great. Just small notational thing, maybe this would be simpler?: extraBlocks = Min(512, lockWaiters * 20); -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Great.
Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);
Attachment
>
>
> On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>>
>> Great.
>>
>> Just small notational thing, maybe this would be simpler?:
>> extraBlocks = Min(512, lockWaiters * 20);
>
>
> Done, new patch attached.
>
Review comments:
1./*+ * RelationAddOneBlock+ *+ * Extend relation by one block and lock the buffer+ */+static Buffer+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)Shall we mention in comments that this API returns locked buffer and it's the responsibility of caller to unlock it.
2.+ /* To avoid the cases when there are huge number of lock waiters, and+ * extend file size by big amount at atime, put some limit on thefirst line in multi-line comments should not contain anything.
3.+ extraBlocks = Min(512, lockWaiters * 20);I think we should explain in comments about the reason of choosing 20 in above calculation.
4. Sometime back [1], you agreed on doing some analysis for the overhead that XLogFlush can cause during buffer eviction, but I don't see the results of same, are you planing to complete the same?
5.+ if (LOCKACQUIRE_OK+ != RelationExtensionLockConditional(relation,ExclusiveLock))I think the coding style is to keep constant on right side of condition, did you see any other place in code which uses the check in a similar way?
6.- /*- * Now acquire lock on the new page.+ /* For all case we need to add at least one block to satisfyour+ * own request.*/Same problem as in point 2.
7.@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,if (needLock)UnlockRelationForExtension(relation, ExclusiveLock);++Spurious newline addition.
8.+int+RelationExtensionLockWaiter(Relation relation)How about naming this function as RelationExtensionLockWaiterCount?
9.+ /* Other waiter has extended the block for us*/Provide an extra space before ending the comment.
10.+ if (use_fsm)+ {+ if (lastValidBlock !=InvalidBlockNumber)+ {+ targetBlock =RecordAndGetPageWithFreeSpace(relation,+lastValidBlock,+pageFreeSpace,+len + saveFreeSpace);+ }Are you using RecordAndGetPageWithFreeSpace() instead of GetPageWithFreeSpace() to get the page close to the previous target page? If yes, then do you see enough benefit of the same that it can compensate the additional write operation which Record* API might cause due to additional dirtying of buffer?
Here we are calling RecordAndGetPageWithFreeSpace instead of GetPageWithFreeSpace, because other backend who have got the lock might have added extra block in the FSM and its possible that FSM tree might not have been updated so far and we will not get the page by searching from top using GetPageWithFreeSpace, so we need to search the leaf page directly using our last valid target block.
Explained same in the comments...
11.+ {+ /*+ * First try to get the lock in no-wait mode, if succeed extend one+* block, else get the lock in normal mode and after we get the lock- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);+ buffer =RelationAddOneBlock(relation, otherBuffer, bistate);Won't this cause one extra block addition after backend extends the relation for multiple blocks, what is the need of same?
12. I think it is good to once test pgbench read-write tests to ensure that this doesn't introduce any new regression.
Attachment
On 22/03/16 10:15, Dilip Kumar wrote: > > On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit.kapila16@gmail.com > <mailto:amit.kapila16@gmail.com>> wrote: > 11. > +{ > +/* > +* First try to get the lock in no-wait mode, if succeed extend one > + > * block, else get the lock in normal mode and after we get the lock > -LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); > +buffer = > RelationAddOneBlock(relation, otherBuffer, bistate); > > Won't this cause one extra block addition after backend extends the > relation for multiple blocks, what is the need of same? > > > This is the block for this backend, Extra extend for future request and > already added to FSM. I could have added this count along with extra > block in RelationAddExtraBlocks, But In that case I need to put some > extra If for saving one buffer for this bakend and then returning that > the specific buffer to caller, and In caller also need to distinguish > between who wants to add one block or who have got one block added in > along with extra block. > > I think this way code is simple.. That everybody comes down will add one > block for self use. and all other functionality and logic is above, i.e. > wether to take lock or not, whether to add extra blocks or not.. > > I also think the code simplicity makes this worth it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: > I also think the code simplicity makes this worth it. Agreed. I went over this patch and did a cleanup pass today. I discovered that the LockWaiterCount() function was broken if you try to tried to use it on a lock that you didn't hold or a lock that you held in any mode other than exclusive, so I tried to fix that. I rewrote a lot of the comments and tightened some other things up. The result is attached. I'm baffled by the same code Amit asked about upthread, even though there's now a comment: + /* + * Here we are calling RecordAndGetPageWithFreeSpace + * instead of GetPageWithFreeSpace, because other backend + * who have got the lock might have added extra blocks in + * the FSM and its possible that FSM tree might not have + * been updated so far and we will not get the page by + * searching from top using GetPageWithFreeSpace, so we + * need to search the leaf page directly using our last + * valid target block. + * + * XXX. I don't understand what is happening here. -RMH + */ I've read this over several times and looked at RecordAndGetPageWithFreeSpace() and I'm still confused. First of all, if the lock was acquired by some other backend which did RelationAddExtraBlocks(), it *will* have updated the FSM - that's the whole point. Second, if the other backend extended the relation in some other manner and did not extend the FSM, how does calling RecordAndGetPageWithFreeSpace help? As far as I can see, GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just searching the FSM, so if one is stymied the other will be too. What am I missing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 23/03/16 19:39, Robert Haas wrote: > On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >> I also think the code simplicity makes this worth it. > > Agreed. I went over this patch and did a cleanup pass today. I > discovered that the LockWaiterCount() function was broken if you try > to tried to use it on a lock that you didn't hold or a lock that you > held in any mode other than exclusive, so I tried to fix that. I > rewrote a lot of the comments and tightened some other things up. The > result is attached. > > I'm baffled by the same code Amit asked about upthread, even though > there's now a comment: > > + /* > + * Here we are calling > RecordAndGetPageWithFreeSpace > + * instead of GetPageWithFreeSpace, > because other backend > + * who have got the lock might have > added extra blocks in > + * the FSM and its possible that FSM > tree might not have > + * been updated so far and we will not > get the page by > + * searching from top using > GetPageWithFreeSpace, so we > + * need to search the leaf page > directly using our last > + * valid target block. > + * > + * XXX. I don't understand what is > happening here. -RMH > + */ > > I've read this over several times and looked at > RecordAndGetPageWithFreeSpace() and I'm still confused. First of all, > if the lock was acquired by some other backend which did > RelationAddExtraBlocks(), it *will* have updated the FSM - that's the > whole point. That's good point, maybe this coding is bit too defensive. > Second, if the other backend extended the relation in > some other manner and did not extend the FSM, how does calling > RecordAndGetPageWithFreeSpace help? As far as I can see, > GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just > searching the FSM, so if one is stymied the other will be too. What > am I missing? > The RecordAndGetPageWithFreeSpace will extend FSM as it calls fsm_set_and_search which in turn calls fsm_readbuf with extend = true. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >> Second, if the other backend extended the relation in >> some other manner and did not extend the FSM, how does calling >> RecordAndGetPageWithFreeSpace help? As far as I can see, >> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just >> searching the FSM, so if one is stymied the other will be too. What >> am I missing? >> > > The RecordAndGetPageWithFreeSpace will extend FSM as it calls > fsm_set_and_search which in turn calls fsm_readbuf with extend = true. So how does that help? If I'm reading this right, the new block will be all zeroes which means no space available on any of those pages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23/03/16 20:01, Robert Haas wrote: > On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >>> Second, if the other backend extended the relation in >>> some other manner and did not extend the FSM, how does calling >>> RecordAndGetPageWithFreeSpace help? As far as I can see, >>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just >>> searching the FSM, so if one is stymied the other will be too. What >>> am I missing? >>> >> >> The RecordAndGetPageWithFreeSpace will extend FSM as it calls >> fsm_set_and_search which in turn calls fsm_readbuf with extend = true. > > So how does that help? If I'm reading this right, the new block will > be all zeroes which means no space available on any of those pages. > I am bit confused as to what exactly you are saying, but what will happen is we get back to the while cycle and try again so eventually we should find either block with enough free space or add new one (not sure if this would actually ever happen in practice in heavily concurrent workload where the FSM would not be correctly extended during relation extension though, we might just loop here forever). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 23/03/16 20:19, Petr Jelinek wrote: > On 23/03/16 20:01, Robert Haas wrote: >> On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> >> wrote: >>>> Second, if the other backend extended the relation in >>>> some other manner and did not extend the FSM, how does calling >>>> RecordAndGetPageWithFreeSpace help? As far as I can see, >>>> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just >>>> searching the FSM, so if one is stymied the other will be too. What >>>> am I missing? >>>> >>> >>> The RecordAndGetPageWithFreeSpace will extend FSM as it calls >>> fsm_set_and_search which in turn calls fsm_readbuf with extend = true. >> >> So how does that help? If I'm reading this right, the new block will >> be all zeroes which means no space available on any of those pages. >> > > I am bit confused as to what exactly you are saying, but what will > happen is we get back to the while cycle and try again so eventually we > should find either block with enough free space or add new one (not sure > if this would actually ever happen in practice in heavily concurrent > workload where the FSM would not be correctly extended during relation > extension though, we might just loop here forever). > Btw thinking about it some more, ISTM that not finding the block and just doing the extension if the FSM wasn't extended correctly previously is probably cleaner behavior than what we do now. The reasoning for that opinion is that if the FSM wasn't extended, we'll fix it by doing relation extension since we know we do both in this code path and also if we could not find page before we'll most likely not find one even on retry and if there was page added at the end by extension that we might reuse partially here then there is no harm in adding new one anyway as the whole point of this patch is that it does bigger extension that strictly necessary so insisting on page reuse for something that seems like only theoretical possibility that does not even exist in current code does not seem right. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: > Btw thinking about it some more, ISTM that not finding the block and just > doing the extension if the FSM wasn't extended correctly previously is > probably cleaner behavior than what we do now. The reasoning for that > opinion is that if the FSM wasn't extended, we'll fix it by doing relation > extension since we know we do both in this code path and also if we could > not find page before we'll most likely not find one even on retry and if > there was page added at the end by extension that we might reuse partially > here then there is no harm in adding new one anyway as the whole point of > this patch is that it does bigger extension that strictly necessary so > insisting on page reuse for something that seems like only theoretical > possibility that does not even exist in current code does not seem right. I'm not sure I completely follow this. The fact that the last sentence is 9 lines long may be related. :-) I think it's pretty clearly important to re-check the FSM after acquiring the extension lock. Otherwise, imagine that 25 backends arrive at the exact same time. The first one gets the lock and extends the relation 500 pages; the next one, 480, and so on. In total, they extend the relation by 6500 pages, which is a bit rich. Rechecking the FSM after acquiring the lock prevents that from happening, and that's a very good thing. We'll do one 500-page extension and that's it. However, I don't think using RecordAndGetPageWithFreeSpace rather than GetPageWithFreeSpace is appropriate. We've already recorded free space on that page, and recording it again is a bad idea. It's quite possible that by the time we get the lock our old value is totally inaccurate. If there's some advantage to searching in the more targeted way that RecordAndGetPageWithFreeSpace does over GetPageWithFreeSpace then we need a new API into the freespace stuff that does the more targeted search without updating anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23/03/16 20:43, Robert Haas wrote: > On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote: >> Btw thinking about it some more, ISTM that not finding the block and just >> doing the extension if the FSM wasn't extended correctly previously is >> probably cleaner behavior than what we do now. The reasoning for that >> opinion is that if the FSM wasn't extended, we'll fix it by doing relation >> extension since we know we do both in this code path and also if we could >> not find page before we'll most likely not find one even on retry and if >> there was page added at the end by extension that we might reuse partially >> here then there is no harm in adding new one anyway as the whole point of >> this patch is that it does bigger extension that strictly necessary so >> insisting on page reuse for something that seems like only theoretical >> possibility that does not even exist in current code does not seem right. > > I'm not sure I completely follow this. The fact that the last > sentence is 9 lines long may be related. :-) > I tend to do that sometimes :) > I think it's pretty clearly important to re-check the FSM after > acquiring the extension lock. Otherwise, imagine that 25 backends > arrive at the exact same time. The first one gets the lock and > extends the relation 500 pages; the next one, 480, and so on. In > total, they extend the relation by 6500 pages, which is a bit rich. > Rechecking the FSM after acquiring the lock prevents that from > happening, and that's a very good thing. We'll do one 500-page > extension and that's it. > Right, but that would only happen if all the backends did it using different code which does not do the FSM extension because the current code does FSM extension and the point of using RecordAndGetPageWithFreeSpace seems to be "just in case" somebody is doing extension differently (at least I don't see other reason). So basically I am not saying we shouldn't do the search but that I agree GetPageWithFreeSpace should be enough as the worst that can happen is that we overextend the relation in case some theoretical code from somewhere else also did extension of relation without extending FSM (afaics). But maybe Dilip had some other reason for using the RecordAndGetPageWithFreeSpace that is not documented. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>
> On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
>
> I've read this over several times and looked at
> RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
> if the lock was acquired by some other backend which did
> RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
> whole point.
>
>
> Second, if the other backend extended the relation in
> some other manner and did not extend the FSM, how does calling
> RecordAndGetPageWithFreeSpace help? As far as I can see,
> GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
> searching the FSM, so if one is stymied the other will be too. What
> am I missing?
>
On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't > update till root, it will be able to search the newly added page. I agree > with whatever you have said in another mail that we should introduce a new > API to do a more targeted search for such cases. OK, let's do that, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
> it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
> update till root, it will be able to search the newly added page. I agree
> with whatever you have said in another mail that we should introduce a new
> API to do a more targeted search for such cases.
OK, let's do that, then.
Ok, I have added new API which just find the free block from and start search from last given page.
Attachment
On 24/03/16 07:04, Dilip Kumar wrote: > > On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila > <amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote: > > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to > > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't > > update till root, it will be able to search the newly added page. I agree > > with whatever you have said in another mail that we should introduce a new > > API to do a more targeted search for such cases. > > OK, let's do that, then. > > > Ok, I have added new API which just find the free block from and start > search from last given page. > > 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I > don't like this name, but I could not come up with any better, Please > suggest one. > GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>
> On 24/03/16 07:04, Dilip Kumar wrote:
>>
>>
>> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com
>> <mailto:robertmhaas@gmail.com>> wrote:
>>
>> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
>> <amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote:
>> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
>> > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
>> > update till root, it will be able to search the newly added page. I agree
>> > with whatever you have said in another mail that we should introduce a new
>> > API to do a more targeted search for such cases.
>>
>> OK, let's do that, then.
>>
>>
>> Ok, I have added new API which just find the free block from and start
>> search from last given page.
>>
>> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
>> don't like this name, but I could not come up with any better, Please
>> suggest one.
>>
>
>
> GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better)
>
Better than what is used in patch.
1.+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,+ Size spaceNeeded){..+ /*+ * If fsm_set_and_search found a suitable new block, return that.+ * Otherwise, search as usual.+ */..}In the above comment, you are referring wrong function.
2.+static int+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)+{+ Buffer buf;+ int newslot = -1;++ buf = fsm_readbuf(rel, addr, true);+ LockBuffer(buf, BUFFER_LOCK_SHARE);++ if (minValue != 0)+ {+ /* Search while we still hold the lock */+ newslot = fsm_search_avail(buf, minValue,+ addr.level == FSM_BOTTOM_LEVEL,+ false);In this new API, I don't understand why we need minValue != 0 check, basically if user of API doesn't want to search for space > 0, then what is the need of calling this API? I think this API should use Assert for minValue!=0 unless you see reason for not doing so.
>
> GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better)
>
Better than what is used in patch.Yet another possibility could be to call it as GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with value of oldPage as InvalidBlockNumber.
Yes I like this.. Changed the same.
Attachment
On Thu, Mar 24, 2016 at 7:17 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> Yet another possibility could be to call it as >> GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with >> value of oldPage as InvalidBlockNumber. > > Yes I like this.. Changed the same. After thinking about this some more, I don't think this is the right approach. I finally understand what's going on here: RecordPageWithFreeSpace updates the FSM lazily, only adjusting the leaves and not the upper levels. It relies on VACUUM to update the upper levels. This seems like it might be a bad policy in general, because VACUUM on a very large relation may be quite infrequent, and you could lose track of a lot of space for a long time, leading to a lot of extra bloat. However, it's a particularly bad policy for bulk relation extension, because you're stuffing a large number of totally free pages in there in a way that doesn't make them particularly easy for anybody else to discover. There are two ways we can fail here: 1. Callers who use GetPageWithFreeSpace() rather than GetPageFreeSpaceExtended() will fail to find the new pages if the upper map levels haven't been updated by VACUUM. 2. Even callers who use GetPageFreeSpaceExtended() may fail to find the new pages. This can happen in two separate ways, namely (a) the lastValidBlock saved by RelationGetBufferForTuple() can be in the middle of the relation someplace rather than near the end, or (b) the bulk-extension performed by some other backend can have overflowed onto some new FSM page that won't be searched even though a relatively plausible lastValidBlock was passed. It seems to me that since we're adding a whole bunch of empty pages at once, it's worth the effort to update the upper levels of the FSM. This isn't a case of discovering a single page with an extra few bytes of storage available due to a HOT prune or something - this is a case of putting at least 20 and plausibly hundreds of extra pages into the FSM. The extra effort to update the upper FSM pages is trivial by comparison with the cost of extending the relation by many blocks. So, I suggest adding a new function FreeSpaceMapBulkExtend(BlockNumber first_block, BlockNumber last_block) which sets all the FSM entries for pages between first_block and last_block to 255 and then bubbles that up to the higher levels of the tree and all the way to the root. Have the bulk extend code use that instead of repeatedly calling RecordPageWithFreeSpace. That should actually be much more efficient, because it can call fsm_readbuf(), LockBuffer(), and UnlockReleaseBuffer() just once per FSM page instead of once per FSM page *per byte modified*. Maybe that makes no difference in practice, but it can't hurt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.
2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the
Yeah, that's the issue, if extended pages spills to next FSM page, then other waiters will not find those page, and one by one all waiters will end up adding extra pages.
for example, if there are ~30 waiters then
This is not the case every time but whenever heap block go to new FSM page this will happen.
- FSM page case hold 4096 heap blk info, so after every 8th extend (assume 512 block will extend in one time), it will extend ~6500 pages
(there are still chances that some blocks can be completely left unused, until vacuum comes).
I have changed the patch as per the suggestion (just POC because performance number are not that great)
1. As per above calculation v13 extend ~6500 block (after every 8th extend), and that's why it's performing well.
Copy 10000 tuples, of 4 bytes each..
---------------------------------------------
Client base patch v13 patch v14
1 118 147 126
2 217 276 269
4 210 421 347
8 166 630 375
16 145 813 415
32 124 985 451
64 974 455
Client base patch v13 patch v14
1 117 124 119
2 111 126 119
4 51 128 124
8 43 149 131
16 40 217 120
32 263 115
64 248 109
Attachment
On Fri, Mar 25, 2016 at 1:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 1. Callers who use GetPageWithFreeSpace() rather than >> GetPageFreeSpaceExtended() will fail to find the new pages if the >> upper map levels haven't been updated by VACUUM. >> >> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find >> the new pages. This can happen in two separate ways, namely (a) the > > Yeah, that's the issue, if extended pages spills to next FSM page, then > other waiters will not find those page, and one by one all waiters will end > up adding extra pages. > for example, if there are ~30 waiters then > total blocks extended = (25(25+1)/2) *20 =~ 6500 pages. > > This is not the case every time but whenever heap block go to new FSM page > this will happen. I think we need to start testing these patches not only in terms of how *fast* they are but how *large* the relation ends up being when we're done. A patch that inserts the rows slower but the final relation is smaller may be better overall. Can you retest v13, v14, and master, and post not only the timings but the relation size afterwards? And maybe post the exact script you are using? > performance of patch v14 is significantly low compared to v13, mainly I > guess below reasons > 1. As per above calculation v13 extend ~6500 block (after every 8th extend), > and that's why it's performing well. That should be completely unnecessary, though. I mean, if the problem is that it's expensive to repeatedly acquire and release the relation extension lock, then bulk-extending even 100 blocks at a time should be enough to fix that, because you've reduced the number of times that has to be done by 99%. There's no way we should need to extend by thousands of blocks to get good performance. Maybe something like this would help: if (needLock) { if (!use_fsm) LockRelationForExtension(relation, ExclusiveLock); else if (!ConditionLockRelationForExtension(relation,ExclusiveLock)) { BlockNumber last_blkno = RelationGetNumberOfBlocks(relation); targetBlock = GetPageWithFreeSpaceExtended(relation, last_blkno, len + saveFreeSpace); if (targetBlock != InvalidBlockNumber) goto loop; LockRelationForExtension(relation, ExclusiveLock); targetBlock = GetPageWithFreeSpace(relation, len+ saveFreeSpace); if (targetBlock != InvalidBlockNumber) { UnlockRelationForExtension(relation,ExclusiveLock); goto loop; } RelationAddExtraBlocks(relation,bistate); } } I think this is better than what you had before with lastValidBlock, because we're actually interested in searching the free space map at the *very end* of the relation, not wherever the last target block happens to have been. We could go further still and have GetPageWithFreeSpace() always search the last, say, two pages of the FSM in all cases. But that might be expensive. The extra call to RelationGetNumberOfBlocks seems cheap enough here because the alternative is to wait for a contended heavyweight lock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done. A patch that inserts the rows slower but the final
relation is smaller may be better overall. Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards? And maybe post the exact script you are using?
Maybe something like this would help:
if (needLock)
{
if (!use_fsm)
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
{
BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);
targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
goto loop;
LockRelationForExtension(relation, ExclusiveLock);
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
RelationAddExtraBlocks(relation, bistate);
}
}
I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.
We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.
Attachment
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.I will try the test with this also and post the results.
I have changed v14 as per this suggestion and results are same as v14.
Attachment
We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.I will try the test with this also and post the results.
Attachment
On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Relation Size > ----------------- > INSERT : 16000 transaction from 32 Client > > Base v13 v14_1 > --------- --------- -------- > TPS 37 255 112 > Rel Size 17GB 17GB 18GB > > COPY: 32000 transaction from 32 client > Base v13 v14_1 > --------- --------- --------- > TPS 121 823 427 > Rel Size 11GB 11GB 11GB > > > Script are attached in the mail > ----------------------------------------= > test_size_ins.sh --> run insert test and calculate relation size. > test_size_copy --> run copy test and relation size > copy_script -> copy pg_bench script used by test_size_copy.sh > insert_script --> insert pg_bench script used by test_size_ins.sh > multi_extend_v14_poc_v1.patch --> modified patch of v14. > > I also tried modifying v14 from different different angle. > > One is like below--> > ------------------------- > In AddExtraBlock > { > I add page to FSM one by one like v13 does. > then update the full FSM tree up till root > } Not following this. Did you attach this version? > Results: > ---------- > 1. With this performance is little less than v14 but the problem of extra > relation size is solved. > 2. With this we can conclude that extra size of relation in v14 is because > some while extending the pages, its not immediately available and at end > some of the pages are left unused. I agree with that conclusion. I'm not quite sure where that leaves us, though. We can go back to v13, but why isn't that producing extra pages? It seems like it should: whenever a bulk extend rolls over to a new FSM page, concurrent backends will search either the old or the new one but not both. Maybe we could do this - not sure if it's what you were suggesting above: 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one. 2. After inserting them all, go back and update the upper levels of the FSM tree up the root. Another idea is: If ConditionalLockRelationForExtension fails to get the lock immediately, search the last *two* pages of the FSM for a free page. Just brainstorming here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Not following this. Did you attach this version?> One is like below-->
> -------------------------
> In AddExtraBlock
> {
> I add page to FSM one by one like v13 does.
> then update the full FSM tree up till root
> }
> Results:
> ----------
> 1. With this performance is little less than v14 but the problem of extra
> relation size is solved.
> 2. With this we can conclude that extra size of relation in v14 is because
> some while extending the pages, its not immediately available and at end
> some of the pages are left unused.
I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.
Maybe we could do this - not sure if it's what you were suggesting above:
1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.
Another idea is:
If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.
Just brainstorming here.
>
> On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > Results:
> > ----------
> > 1. With this performance is little less than v14 but the problem of extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion. I'm not quite sure where that leaves
> us, though. We can go back to v13, but why isn't that producing extra
> pages? It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>
I have not debugged the flow, but by looking at v13 code, it looks like it will search both old and new. In function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(), the basic idea of search is: Start the search from the target slot. At every step, move onenode to the right, then climb up to the parent. Stop when we reach a node with enough free space (as we must, since the root has enough space).So shouldn't it be able to find the new FSM page where the bulk extend rolls over?
This is actually multi level tree, So each FSM page contain one slot tree.
I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.
Maybe we could do this - not sure if it's what you were suggesting above:
1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.I think this is better option, Since we will search last two pages of FSM tree, then no need to update the upper level of the FSM tree. Right ?I will test and post the result with this option.
Attachment
1. Relation Size : No change in size, its same as base and v132. INSERT 1028 Byte 1000 tuple performance-----------------------------------------------------------Client base v13 v151 117 124 1222 111 126 1234 51 128 1258 43 149 13516 40 217 14732 35 263 1413. COPY 10000 Tuple performance.----------------------------------------------Client base v13 v151 118 147 1552 217 276 2734 210 421 4578 166 630 64316 145 813 59532 124 985 598Conclusion:---------------1. I think v15 is solving the problem exist with v13 and performance is significantly high compared to base, and relation size is also stable, So IMHO V15 is winner over other solution, what other thinks ?2. And no point in thinking that V13 is better than V15 because, V13 has bug of sometime extending more than expected pages and that is uncontrolled and same can be the reason also of v13 performing better.
Found one problem with V15, so sending the new version.
Attachment
On 28/03/16 14:46, Dilip Kumar wrote: > > Conclusion: > --------------- > 1. I think v15 is solving the problem exist with v13 and performance > is significantly high compared to base, and relation size is also > stable, So IMHO V15 is winner over other solution, what other thinks ? > It seems so, do you have ability to reasonably test with 64 clients? I am mostly wondering if we see the performance going further down or just plateau. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Mar 27, 2016 at 9:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think this is better option, Since we will search last two pages of FSM > tree, then no need to update the upper level of the FSM tree. Right ? Well, it's less important in that case, but I think it's still worth doing. Some people are going to do just plain GetPageWithFreeSpace(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 28, 2016 at 8:46 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Found one problem with V15, so sending the new version. > In V15 I am taking prev_blkno as targetBlock instead it should be the last > block of the relation at that time. Attaching new patch. BlockNumber targetBlock, - otherBlock; + otherBlock, + prev_blkno = RelationGetNumberOfBlocks(relation); Absolutely not. There is no way it's acceptable to introduce an unconditional call to RelationGetNumberOfBlocks() into every call to RelationGetBufferForTuple(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 28/03/16 14:46, Dilip Kumar wrote:
Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance
is significantly high compared to base, and relation size is also
stable, So IMHO V15 is winner over other solution, what other thinks ?
It seems so, do you have ability to reasonably test with 64 clients? I am mostly wondering if we see the performance going further down or just plateau.
Well, it's less important in that case, but I think it's still worth
doing. Some people are going to do just plain GetPageWithFreeSpace().
Attachment
Attachment
On Tue, Mar 29, 2016 at 1:29 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Attaching new version v18 > > - Some cleanup work on v17. > - Improved UpdateFreeSpaceMap function. > - Performance and space utilization are same as V17 Looks better. Here's a v19 that I hacked on a bit. Unfortunately, one compiler I tried this with had a pretty legitimate complaint: hio.c: In function ‘RelationGetBufferForTuple’: hio.c:231:20: error: ‘freespace’ may be used uninitialized in this function [-Werror=uninitialized] hio.c:185:7: note: ‘freespace’ was declared here hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this function [-Werror=uninitialized] hio.c:181:14: note: ‘blockNum’ was declared here There's nothing whatsoever to prevent RelationExtensionLockWaiterCount from returning 0. It's also rather ugly that the call to UpdateFreeSpaceMap() assumes that the last value returned by PageGetHeapFreeSpace() is as good as any other, but maybe we can just install a comment explaining that point; there's not an obviously better approach that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here
There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.
It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.
Attachment
+ if (lockWaiters)+ /*+ * Here we are using same freespace for all the Blocks, but that+ * is Ok, because all are newly added blocks and have same freespace+ * And even some block which we just added to FreespaceMap above, is+ * used by some backend and now freespace is not same, will not harm+ * anything, because actual freespace will be calculated by user+ * after getting the page.+ */+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);Does this look good ?
Done in better way..
Attachment
Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30.
I tested COPY 10000 record, by increasing the number of blocks just to find out why we are not as good as V13
On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30.
I tested COPY 10000 record, by increasing the number of blocks just to find out why we are not as good as V13with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..COPY 10000--------------------Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))-------- ---------16 75232 708This proves that main reason of v13 being better is its adding extra blocks without control.though v13 is better than these results, I think we can get that also by changing multiplier and max limit .But I think we are ok with the max size as 4MB (512 blocks) right?.
On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> Yes, that makes sense. One more point is that if the reason for v13 >> giving better performance is extra blocks (which we believe in certain cases >> can leak till the time Vacuum updates the FSM tree), do you think it makes >> sense to once test by increasing lockWaiters * 20 limit to may be >> lockWaiters * 25 or lockWaiters * 30. > > > I tested COPY 10000 record, by increasing the number of blocks just to find > out why we are not as good as V13 > with extraBlocks = Min( lockWaiters * 40, 2048) and got below results.. > > COPY 10000 > -------------------- > Client Patch(extraBlocks = Min( lockWaiters * 40, 2048)) > -------- --------- > 16 752 > 32 708 > > This proves that main reason of v13 being better is its adding extra blocks > without control. > though v13 is better than these results, I think we can get that also by > changing multiplier and max limit . > > But I think we are ok with the max size as 4MB (512 blocks) right? Yeah, kind of. But obviously if we could make the limit smaller without hurting performance, that would be better. Per my note yesterday about performance degradation with parallel COPY, I wasn't able to demonstrate that this patch gives a consistent performance benefit on hydra - the best result I got was speeding up a 9.5 minute load to 8 minutes where linear scalability would have been 2 minutes. And I found cases where it was actually slower with the patch. Now maybe hydra is just a crap machine, but I'm feeling nervous. What machines did you use to test this? Have you tested really large data loads, like many MB or even GB of data? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> Yes, that makes sense. One more point is that if the reason for v13
>> giving better performance is extra blocks (which we believe in certain cases
>> can leak till the time Vacuum updates the FSM tree), do you think it makes
>> sense to once test by increasing lockWaiters * 20 limit to may be
>> lockWaiters * 25 or lockWaiters * 30.
>
>
> I tested COPY 10000 record, by increasing the number of blocks just to find
> out why we are not as good as V13
> with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..
>
> COPY 10000
> --------------------
> Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
> -------- ---------
> 16 752
> 32 708
>
> This proves that main reason of v13 being better is its adding extra blocks
> without control.
> though v13 is better than these results, I think we can get that also by
> changing multiplier and max limit .
>
> But I think we are ok with the max size as 4MB (512 blocks) right?
Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.
Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes.
Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.
Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes. And I found cases where it was actually slower with the
patch. Now maybe hydra is just a crap machine, but I'm feeling
nervous.
What machines did you use to test this? Have you tested really large
data loads, like many MB or even GB of data?
On Thu, Mar 31, 2016 at 9:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > If you need some more information please let me know ? I repeated the testing described in http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com on a MacBook Pro (OS X 10.8.5, 2.4 GHz Intel Core i7, 8GB, early 2013) and got the following results. Note that I did not adjust *_flush_delay in this test because that's always 0, apparently, on MacOS. master, unlogged tables, 1 copy: 0m18.928s, 0m20.276s, 0m18.040s patched, unlogged tables, 1 copy: 0m20.499s, 0m20.879s, 0m18.912s master, unlogged tables, 4 parallel copies: 0m57.301s, 0m58.045s, 0m57.556s patched, unlogged tables, 4 parallel copies: 0m47.994s, 0m45.586s, 0m44.440s master, logged tables, 1 copy: 0m29.353s, 0m29.693s, 0m31.840s patched, logged tables, 1 copy: 0m30.837s, 0m31.567s, 0m36.843s master, logged tables, 4 parallel copies: 1m45.691s, 1m53.085s, 1m35.674s patched, logged tables, 4 parallel copies: 1m21.137s, 1m20.678s, 1m22.419s So the first thing here is that the patch seems to be a clear win in this test. For a single copy, it seems to be pretty much a wash. When running 4 copies in parallel, it is about 20-25% faster with both logged and unlogged tables. The second thing that is interesting is that we are getting super-linear scalability even without the patch: if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but it really takes 60 unpatched or 45 patched. If 1 copy takes 30 seconds, you might expect 4 to take 120 seconds, but in really takes 105 unpatched or 80 patched. So we're not actually I/O constrained on this test, I think, perhaps because this machine has an SSD. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > So the first thing here is that the patch seems to be a clear win in > this test. For a single copy, it seems to be pretty much a wash. > When running 4 copies in parallel, it is about 20-25% faster with both > logged and unlogged tables. The second thing that is interesting is > that we are getting super-linear scalability even without the patch: > if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but > it really takes 60 unpatched or 45 patched. If 1 copy takes 30 > seconds, you might expect 4 to take 120 seconds, but in really takes > 105 unpatched or 80 patched. So we're not actually I/O constrained on > this test, I think, perhaps because this machine has an SSD. It's not unusual for COPY to not be I/O constrained, I believe. -- Peter Geoghegan
On Tue, Apr 5, 2016 at 1:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> So the first thing here is that the patch seems to be a clear win in >> this test. For a single copy, it seems to be pretty much a wash. >> When running 4 copies in parallel, it is about 20-25% faster with both >> logged and unlogged tables. The second thing that is interesting is >> that we are getting super-linear scalability even without the patch: >> if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but >> it really takes 60 unpatched or 45 patched. If 1 copy takes 30 >> seconds, you might expect 4 to take 120 seconds, but in really takes >> 105 unpatched or 80 patched. So we're not actually I/O constrained on >> this test, I think, perhaps because this machine has an SSD. > > It's not unusual for COPY to not be I/O constrained, I believe. Yeah. I've committed the patch now, with some cosmetic cleanup. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Yeah. I've committed the patch now, with some cosmetic cleanup.
Thanks Robert !!!