Thread: Move unused buffers to freelist
<div class="WordSection1"><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">As discussedand concluded in mail thread (<a href="http://www.postgresql.org/message-id/006f01ce34f0$d6fa8220$84ef8660$@kapila@huawei.com">http://www.postgresql.org/message-id/006f01ce34f0$d6fa8220$84ef8660$@kapila@huawei.com</a>), formoving unused buffer’s to freelist end, </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Ihaving implemented the idea and taken some performance data.</span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Inthe attached patch, bgwriter/checkpointer moves unused (usage_count=0 && refcount = 0) buffer’s to end of freelist. I have implemented a new API StrategyMoveBufferToFreeListEnd()to </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D">m</span><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">ovebuffer’s to end of freelist.</span><p class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">PerformanceData :</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">ConfigurationDetails</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">O/S– Suse-11</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">RAM– 24GB</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Numberof Cores – 8</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">ServerConf – checkpoint_segments = 256; checkpoint_timeout = 25min, synchronous_commit = 0FF, shared_buffers = 5GB</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Pgbench– Select-only</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Scalefactor– 1200</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Time– Each run is of 20 mins</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Belowdata is for average 3 runs of 20 minutes</span><p class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> 8C-8T 16C-16T 32C-32T 64C-64T</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">HEAD 11997 8455 4989 2757</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">AfterPatch 19807 13296 8388 2821</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Detailedeach run data is attached with mail.</span><p class="MsoNormal"><spanstyle="color:#1F497D"> </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Thisis just the initial data, I will collect more data based ondifferent configuration of shared buffers and other configurations.</span><span style="color:#1F497D"></span><p class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Feedback/Suggesions?</span><pclass="MsoNormal"><span style="color:#1F497D"> </span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">With Regards,</span><pclass="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Amit Kapila.</span></div>
On 5/14/13 9:42 AM, Amit Kapila wrote: > In the attached patch, bgwriter/checkpointer moves unused (usage_count > =0 && refcount = 0) buffer’s to end of freelist. I have implemented a > new API StrategyMoveBufferToFreeListEnd() to There's a comment in the new function: It is possible that we are told to put something in the freelist that is already in it; don't screw up the list if so. I don't see where the code does anything to handle that though. What was your intention here? This area has always been the tricky part of the change. If you do something complicated when adding new entries, like scanning the freelist for duplicates, you run the risk of holding BufFreelistLock for too long. To try and see that in benchmarks, I would use a small database scale (I typically use 100 for this type of test) and a large number of clients. "-M prepared" would help get a higher transaction rate out of the hardware too. It might take a server with a large core count to notice any issues with holding the lock for too long though. Instead you might just invalidate buffers before they go onto the list. Doing that will then throw away usefully cacheddata though. To try and optimize both insertion speed and retaining cached data, I was thinking about using a hash table for the free buffers, instead of the simple linked list approach used in the code now. Also: check the formatting on the additions to in bufmgr.c, I noticed a spaces vs. tabs difference on lines 35/36 of your patch. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Wednesday, May 15, 2013 12:44 AM Greg Smith wrote: > On 5/14/13 9:42 AM, Amit Kapila wrote: > > In the attached patch, bgwriter/checkpointer moves unused > (usage_count > > =0 && refcount = 0) buffer's to end of freelist. I have implemented a > > new API StrategyMoveBufferToFreeListEnd() to > > There's a comment in the new function: > > It is possible that we are told to put something in the freelist that > is already in it; don't screw up the list if so. > > I don't see where the code does anything to handle that though. What > was your intention here? The intention is that put the entry in freelist only if it is not in freelist which is accomplished by check If (buf->freeNext == FREENEXT_NOT_IN_LIST). Every entry when removed from freelist, buf->freeNext is marked as FREENEXT_NOT_IN_LIST. Code Reference (last line): StrategyGetBuffer() { .. .. while (StrategyControl->firstFreeBuffer >= 0) { buf = &BufferDescriptors[StrategyControl->firstFreeBuffer]; Assert(buf->freeNext != FREENEXT_NOT_IN_LIST); /* Unconditionally remove buffer from freelist */ StrategyControl->firstFreeBuffer = buf->freeNext; buf->freeNext = FREENEXT_NOT_IN_LIST; ... } Also the same check exists in StrategyFreeBuffer(). > This area has always been the tricky part of the change. If you do > something complicated when adding new entries, like scanning the > freelist for duplicates, you run the risk of holding BufFreelistLock > for > too long. Yes, this is true and I had tried to hold this lock for minimal time. In this patch, it holds BufFreelistLock only to put the unused buffer at end of freelist. > To try and see that in benchmarks, I would use a small > database scale (I typically use 100 for this type of test) and a large > number of clients. >"-M prepared" would help get a higher transaction > rate out of the hardware too. It might take a server with a large core > count to notice any issues with holding the lock for too long though. This is good idea, I shall take another set of readings with "-M prepared" > Instead you might just invalidate buffers before they go onto the list. > Doing that will then throw away usefully cached data though. Yes, if we invalidate buffers, it might throw away usefully cached data especially when working set just a tiny bit smaller than shared_buffers. This is pointed by Robert in his mail http://www.postgresql.org/message-id/CA+TgmoYhWsz__KtSxm6BuBirE7VR6Qqc_COkbE ZTQpk8oom3CA@mail.gmail.com > To try and optimize both insertion speed and retaining cached data, I think by the method proposed by patch it takes care of both, because it directly puts free buffer at end of freelist and because it doesn't invalidate the buffers it can retain cached data for longer period. Do you see any flaw with current approach? > I > was thinking about using a hash table for the free buffers, instead of > the simple linked list approach used in the code now. Okay, we can try different methods for maintaining free buffers if we find current approach doesn't turn out to be good. > Also: check the formatting on the additions to in bufmgr.c, I noticed > a > spaces vs. tabs difference on lines 35/36 of your patch. Thanks for pointing it, I shall send an updated patch along with next set of performance data. With Regards, Amit Kapila.
<div class="WordSection1"><p class="MsoPlainText">On Wednesday, May 15, 2013 8:38 AM Amit Kapila wrote:<p class="MsoPlainText">>On Wednesday, May 15, 2013 12:44 AM Greg Smith wrote:<p class="MsoPlainText">> > On 5/14/139:42 AM, Amit Kapila wrote:<p class="MsoPlainText">> > > In the attached patch, bgwriter/checkpointer movesunused<p class="MsoPlainText">> > (usage_count<p class="MsoPlainText">> > > =0 && refcount =0) buffer's to end of freelist. I have implemented<p class="MsoPlainText">> a<p class="MsoPlainText">> > > newAPI StrategyMoveBufferToFreeListEnd() to<p class="MsoPlainText">> ><p class="MsoPlainText">> > There's a commentin the new function:<p class="MsoPlainText">> ><p class="MsoPlainText">> > It is possible that we aretold to put something in the freelist that<p class="MsoPlainText">> > is already in it; don't screw up the listif so.<p class="MsoPlainText">> ><p class="MsoPlainText">> > I don't see where the code does anything tohandle that though. What<p class="MsoPlainText">> > was your intention here?<p class="MsoPlainText">> <p class="MsoPlainText">>The intention is that put the entry in freelist only if it is not in<p class="MsoPlainText">>freelist which is accomplished by check<p class="MsoPlainText">> If (buf->freeNext == FREENEXT_NOT_IN_LIST).Every entry when removed<p class="MsoPlainText">> from<p class="MsoPlainText">> freelist, buf->freeNextis marked as FREENEXT_NOT_IN_LIST.<p class="MsoPlainText">> Code Reference (last line):<p class="MsoPlainText">>StrategyGetBuffer()<p class="MsoPlainText">> {<p class="MsoPlainText">> ..<p class="MsoPlainText">>..<p class="MsoPlainText">> while (StrategyControl->firstFreeBuffer >= 0)<p class="MsoPlainText">> {<p class="MsoPlainText">> buf = &BufferDescriptors[StrategyControl-<pclass="MsoPlainText">> >firstFreeBuffer];<p class="MsoPlainText">> Assert(buf->freeNext!= FREENEXT_NOT_IN_LIST);<p class="MsoPlainText">> <p class="MsoPlainText">> /* Unconditionally remove buffer from freelist */<p class="MsoPlainText">> StrategyControl->firstFreeBuffer= buf->freeNext;<p class="MsoPlainText">> buf->freeNext= FREENEXT_NOT_IN_LIST;<p class="MsoPlainText">> <p class="MsoPlainText">> ...<p class="MsoPlainText">>}<p class="MsoPlainText">> <p class="MsoPlainText">> Also the same check exists in StrategyFreeBuffer().<pclass="MsoPlainText">> <p class="MsoPlainText">> > This area has always been the tricky partof the change. If you do<p class="MsoPlainText">> > something complicated when adding new entries, like scanningthe<p class="MsoPlainText">> > freelist for duplicates, you run the risk of holding BufFreelistLock<p class="MsoPlainText">>> for<p class="MsoPlainText">> > too long.<p class="MsoPlainText">> <p class="MsoPlainText">>Yes, this is true and I had tried to hold this lock for minimal time.<p class="MsoPlainText">>In this patch, it holds BufFreelistLock only to put the unused buffer<p class="MsoPlainText">>at end<p class="MsoPlainText">> of freelist.<p class="MsoPlainText">> <p class="MsoPlainText">>> To try and see that in benchmarks, I would use a small<p class="MsoPlainText">> > databasescale (I typically use 100 for this type of test) and a<p class="MsoPlainText">> large<p class="MsoPlainText">>> number of clients.<p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><spanstyle="color:black">I shall try this test, do you have any suggestions for shred buffers and numberof clients for 100 scale factor?</span><p class="MsoPlainText"> <p class="MsoPlainText">> >"-M prepared" wouldhelp get a higher transaction<p class="MsoPlainText">> > rate out of the hardware too. It might take a serverwith a large<p class="MsoPlainText">> core<p class="MsoPlainText">> > count to notice any issues with holdingthe lock for too long though.<p class="MsoPlainText">> <p class="MsoPlainText">> This is good idea, I shalltake another set of readings with "-M<p class="MsoPlainText">> prepared"<p class="MsoPlainText">> <p class="MsoPlainText">>> Instead you might just invalidate buffers before they go onto the<p class="MsoPlainText">>list.<p class="MsoPlainText">> > Doing that will then throw away usefully cached data though.<pclass="MsoPlainText">> <p class="MsoPlainText">> Yes, if we invalidate buffers, it might throw away usefullycached data<p class="MsoPlainText">> especially when working set just a tiny bit smaller than<p class="MsoPlainText">>shared_buffers.<p class="MsoPlainText">> This is pointed by Robert in his mail<p class="MsoPlainText">>http://www.postgresql.org/message-<p class="MsoPlainText">> id/CA+TgmoYhWsz__KtSxm6BuBirE7VR6Qqc_COkbE<pclass="MsoPlainText">> ZTQpk8oom3CA@mail.gmail.com<p class="MsoPlainText">><p class="MsoPlainText">> <p class="MsoPlainText">> > To try and optimize both insertionspeed and retaining cached data,<p class="MsoPlainText">> <p class="MsoPlainText">> I think by the methodproposed by patch it takes care of both, because<p class="MsoPlainText">> it<p class="MsoPlainText">> directlyputs free buffer at end of freelist and<p class="MsoPlainText">> because it doesn't invalidate the buffers itcan retain cached data for<p class="MsoPlainText">> longer period.<p class="MsoPlainText">> Do you see any flaw withcurrent approach?<p class="MsoPlainText">> <p class="MsoPlainText">> > I<p class="MsoPlainText">> > wasthinking about using a hash table for the free buffers, instead<p class="MsoPlainText">> of<p class="MsoPlainText">>> the simple linked list approach used in the code now.<p class="MsoPlainText">> <p class="MsoPlainText">>Okay, we can try different methods for maintaining free buffers if we<p class="MsoPlainText">>find<p class="MsoPlainText">> current approach doesn't turn out to be good.<p class="MsoPlainText">><p class="MsoPlainText">> > Also: check the formatting on the additions to in bufmgr.c, I<pclass="MsoPlainText">> noticed<p class="MsoPlainText">> > a<p class="MsoPlainText">> > spaces vs. tabsdifference on lines 35/36 of your patch.<p class="MsoPlainText">> <p class="MsoPlainText">> Thanks for pointingit, I shall send an updated patch along with next<p class="MsoPlainText">> set of<p class="MsoPlainText">>performance data.<p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><spanstyle="color:black"> </span><p class="MsoPlainText"><span style="color:black">Further PerformanceData:</span><p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><span style="color:black">Belowdata is for average 3 runs of 20 minutes</span><p class="MsoPlainText"><span style="color:black">ScaleFactor - 1200</span><p class="MsoPlainText"><span style="color:black">Shared Buffers - 7G</span><pclass="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText"><span style="color:black"> 8C-8T 16C-16T 32C-32T 64C-64T</span><p class="MsoPlainText"><span style="color:black">HEAD 1739 1461 1578 1609</span><p class="MsoPlainText"><spanstyle="color:black">After Patch 4029 1924 1743 1706</span><p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><spanstyle="color:black"> </span><p class="MsoPlainText"><span style="color:black">Scale Factor -1200</span><p class="MsoPlainText"><span style="color:black">Shared Buffers – 10G</span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText"><span style="color:black"> 8C-8T 16C-16T 32C-32T 64C-64T</span><p class="MsoPlainText"><span style="color:black">HEAD 2004 2270 2195 2173</span><p class="MsoPlainText"><spanstyle="color:black">After Patch 2298 2172 2111 2044</span><p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><spanstyle="color:black"> </span><p class="MsoPlainText"><span style="color:black">Detailed data of3 runs is attached with mail.</span><p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><spanstyle="color:black">Observations :</span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText" style="margin-left:.5in;text-indent:-.25in;mso-list:l1 level1 lfo1"><spanstyle="color:black"><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman""> </span></span></span><spanstyle="color:black">For scale factor 1200, With 5G and 7G Shared buffers, </span><p class="MsoPlainText"style="margin-left:66.0pt;text-indent:-.25in;mso-list:l0 level1 lfo2"><span style="color:black"><spanstyle="mso-list:Ignore">a.<span style="font:7.0pt "Times New Roman""> </span></span></span><spanstyle="color:black">there is reasonably good performance after patch (>50%).</span><p class="MsoPlainText"style="margin-left:66.0pt;text-indent:-.25in;mso-list:l0 level1 lfo2"><span style="color:black"><spanstyle="mso-list:Ignore">b.<span style="font:7.0pt "Times New Roman""> </span></span></span><spanstyle="color:black">However the performance increase is not so good when number of clients-threadsincrease. </span><p class="MsoPlainText" style="margin-left:66.0pt"><span style="color:black">The reason forit can be that at higher number of clients/threads, there are other blocking factors(other LWLocks, I/O) that limit thebenefit of moving buffers to freelist</span><p class="MsoPlainText" style="margin-left:.5in;text-indent:-.25in;mso-list:l1level1 lfo1"><span style="color:black"><span style="mso-list:Ignore">2.<spanstyle="font:7.0pt "Times New Roman""> </span></span></span><span style="color:black">Forscale factor 1200, With 10G Shared buffers, </span><p class="MsoPlainText" style="margin-left:66.0pt;text-indent:-.25in;mso-list:l3level1 lfo3"><span style="color:black"><span style="mso-list:Ignore">a.<spanstyle="font:7.0pt "Times New Roman""> </span></span></span><span style="color:black">Performanceincrease is observed for 8 clients/8 threads reading</span><p class="MsoPlainText" style="margin-left:66.0pt;text-indent:-.25in;mso-list:l3level1 lfo3"><span style="color:black"><span style="mso-list:Ignore">b.<spanstyle="font:7.0pt "Times New Roman""> </span></span></span><span style="color:black">Thereis performance dip (3~6%) from 16C onwards. The reasons could be</span><p class="MsoPlainText" style="margin-left:84.0pt;text-indent:-.25in;mso-list:l2level1 lfo4"><span style="color:black"><span style="mso-list:Ignore">a.<spanstyle="font:7.0pt "Times New Roman""> </span></span></span><span style="color:black">thatwith such a long buffer list, actually taking BufFreeListLock by BGwriter frequently (bgwrite_delay= 200ms) can add to Concurrency overhead which is overcoming the need for getting </span><p class="MsoPlainText"style="margin-left:84.0pt"><span style="color:black">buffer from freelist. </span><p class="MsoPlainText"style="margin-left:84.0pt;text-indent:-.25in;mso-list:l2 level1 lfo4"><span style="color:black"><spanstyle="mso-list:Ignore">b.<span style="font:7.0pt "Times New Roman""> </span></span></span><spanstyle="color:black">The other reason is sometimes it comes to free the buffer which is alreadyin freelist. It can also add to small overhead as currently to check weather buffer is in freelist, we need to takeBufFreeListLock</span><p class="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><span style="color:black">Iwill try to find more reasons for 2b and work to resolve performance dip of 2b.</span><p class="MsoPlainText"><spanstyle="color:black"> </span><p class="MsoPlainText"><span style="color:black">Any suggestions willbe really helpful to proceed and crack this problem.</span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText"><span style="color:black">With Regards,</span><p class="MsoPlainText"><spanstyle="color:black">Amit Kapila.</span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText"><span style="color:black"> </span><p class="MsoPlainText"><span style="color:black"> </span><pclass="MsoPlainText"><span style="color:black"> </span></div>
On Thu, May 16, 2013 at 10:18 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > Further Performance Data: > > Below data is for average 3 runs of 20 minutes > > Scale Factor - 1200 > Shared Buffers - 7G These results are good but I don't get similar results in my own testing. I ran pgbench tests at a variety of client counts and scale factors, using 30-minute test runs and the following non-default configuration parameters. shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' Here are the results. The first field in each line is the number of clients. The second number is the scale factor. The numbers after "master" and "patched" are the median of three runs. 01 100 master 1433.297699 patched 1420.306088 01 300 master 1371.286876 patched 1368.910732 01 1000 master 1056.891901 patched 1067.341658 01 3000 master 637.312651 patched 685.205011 08 100 master 10575.017704 patched 11456.043638 08 300 master 9262.601107 patched 9120.925071 08 1000 master 1721.807658 patched 1800.733257 08 3000 master 819.694049 patched 854.333830 32 100 master 26981.677368 patched 27024.507600 32 300 master 14554.870871 patched 14778.285400 32 1000 master 1941.733251 patched 1990.248137 32 3000 master 846.654654 patched 892.554222 And here's the same results for 5-minute, read-only tests: 01 100 master 9361.073952 patched 9049.553997 01 300 master 8640.235680 patched 8646.590739 01 1000 master 8339.364026 patched 8342.799468 01 3000 master 7968.428287 patched 7882.121547 08 100 master 71311.491773 patched 71812.899492 08 300 master 69238.839225 patched 70063.632081 08 1000 master 34794.778567 patched 65998.468775 08 3000 master 60834.509571 patched 61165.998080 32 100 master 203168.264456 patched 205258.283852 32 300 master 199137.276025 patched 200391.633074 32 1000 master 177996.853496 patched 176365.732087 32 3000 master 149891.147442 patched 148683.269107 Something appears to have screwed up my results for 8 clients @ scale factor 300 on master, but overall, on both the read-only and read-write tests, I'm not seeing anything that resembles the big gains you reported. Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel 3.2.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday, May 20, 2013 6:54 PM Robert Haas wrote: > On Thu, May 16, 2013 at 10:18 AM, Amit Kapila <amit.kapila@huawei.com> > wrote: > > Further Performance Data: > > > > Below data is for average 3 runs of 20 minutes > > > > Scale Factor - 1200 > > Shared Buffers - 7G > > These results are good but I don't get similar results in my own > testing. Thanks for running detailed tests > I ran pgbench tests at a variety of client counts and scale > factors, using 30-minute test runs and the following non-default > configuration parameters. > > shared_buffers = 8GB > maintenance_work_mem = 1GB > synchronous_commit = off > checkpoint_segments = 300 > checkpoint_timeout = 15min > checkpoint_completion_target = 0.9 > log_line_prefix = '%t [%p] ' > > Here are the results. The first field in each line is the number of > clients. The second number is the scale factor. The numbers after > "master" and "patched" are the median of three runs. > > 01 100 master 1433.297699 patched 1420.306088 > 01 300 master 1371.286876 patched 1368.910732 > 01 1000 master 1056.891901 patched 1067.341658 > 01 3000 master 637.312651 patched 685.205011 > 08 100 master 10575.017704 patched 11456.043638 > 08 300 master 9262.601107 patched 9120.925071 > 08 1000 master 1721.807658 patched 1800.733257 > 08 3000 master 819.694049 patched 854.333830 > 32 100 master 26981.677368 patched 27024.507600 > 32 300 master 14554.870871 patched 14778.285400 > 32 1000 master 1941.733251 patched 1990.248137 > 32 3000 master 846.654654 patched 892.554222 Is the above test for tpc-b? In the above tests, there is performance increase from 1~8% and decrease from 0.2~1.5% > And here's the same results for 5-minute, read-only tests: > > 01 100 master 9361.073952 patched 9049.553997 > 01 300 master 8640.235680 patched 8646.590739 > 01 1000 master 8339.364026 patched 8342.799468 > 01 3000 master 7968.428287 patched 7882.121547 > 08 100 master 71311.491773 patched 71812.899492 > 08 300 master 69238.839225 patched 70063.632081 > 08 1000 master 34794.778567 patched 65998.468775 > 08 3000 master 60834.509571 patched 61165.998080 > 32 100 master 203168.264456 patched 205258.283852 > 32 300 master 199137.276025 patched 200391.633074 > 32 1000 master 177996.853496 patched 176365.732087 > 32 3000 master 149891.147442 patched 148683.269107 > > Something appears to have screwed up my results for 8 clients @ scale > factor 300 on master, Do you want to say the reading of 1000 scale factor? >but overall, on both the read-only and > read-write tests, I'm not seeing anything that resembles the big gains > you reported. I have not generated numbers for read-write tests, I will check that once. For read-only tests, the performance increase is minor and different from what I saw. Few points which I could think of for difference in data: 1. In my test's I always observed best data when number of clients/threads are equal to number of cores which in your case should be at 16. 2. I think for scale factor 100 and 300, there should not be much performance increase, as for them they should mostly get buffer from freelist inspite of even bgwriter adds to freelist or not. 3. In my tests variance is for shared buffers, database size is always less than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due to variance in shared buffers, it can lead to I/O. 4. Each run is of 20 minutes, not sure if this has any difference. > Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the > PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel 3.2.6. To think about the difference in your and my runs, could you please tell me about below points 1. What is RAM in machine. 2. Are number of threads equal to number of clients. 3. Before starting tests I have always done pre-warming of buffers (used pg_prewarm written by you last year), is it same for above read-only tests. 4. Can you please once again run only the test where you saw variation(8 clients @ scale> factor 1000 on master), because I have also seen that performance difference is very good for certain configurations(Scale Factor, RAM, Shared Buffers) Apart from above, I had one more observation during my investigation to find why in some cases, there is a small dip: 1. Many times, it finds the buffer in free list is not usable, means it's refcount or usage count is not zero, due to which it had to spend more time under BufFreelistLock. I had not any further experiments related to this finding like if it really adds any overhead. Currently I am trying to find reasons for small dip of performance and see if I could do something to avoid it. Also I will run tests with various configurations. Any other suggestions? With Regards, Amit Kapila.
On Tuesday, May 21, 2013 12:36 PM Amit Kapila wrote: > On Monday, May 20, 2013 6:54 PM Robert Haas wrote: > > On Thu, May 16, 2013 at 10:18 AM, Amit Kapila > <amit.kapila@huawei.com> > > wrote: > > > Further Performance Data: > > > > > > Below data is for average 3 runs of 20 minutes > > > > > > Scale Factor - 1200 > > > Shared Buffers - 7G > > > > These results are good but I don't get similar results in my own > > testing. > > Thanks for running detailed tests > > > I ran pgbench tests at a variety of client counts and scale > > factors, using 30-minute test runs and the following non-default > > configuration parameters. > > > > shared_buffers = 8GB > > maintenance_work_mem = 1GB > > synchronous_commit = off > > checkpoint_segments = 300 > > checkpoint_timeout = 15min > > checkpoint_completion_target = 0.9 > > log_line_prefix = '%t [%p] ' > > > > Here are the results. The first field in each line is the number of > > clients. The second number is the scale factor. The numbers after > > "master" and "patched" are the median of three runs. > > > > 01 100 master 1433.297699 patched 1420.306088 > > 01 300 master 1371.286876 patched 1368.910732 > > 01 1000 master 1056.891901 patched 1067.341658 > > 01 3000 master 637.312651 patched 685.205011 > > 08 100 master 10575.017704 patched 11456.043638 > > 08 300 master 9262.601107 patched 9120.925071 > > 08 1000 master 1721.807658 patched 1800.733257 > > 08 3000 master 819.694049 patched 854.333830 > > 32 100 master 26981.677368 patched 27024.507600 > > 32 300 master 14554.870871 patched 14778.285400 > > 32 1000 master 1941.733251 patched 1990.248137 > > 32 3000 master 846.654654 patched 892.554222 > > > Is the above test for tpc-b? > In the above tests, there is performance increase from 1~8% and > decrease > from 0.2~1.5% > > > And here's the same results for 5-minute, read-only tests: > > > > 01 100 master 9361.073952 patched 9049.553997 > > 01 300 master 8640.235680 patched 8646.590739 > > 01 1000 master 8339.364026 patched 8342.799468 > > 01 3000 master 7968.428287 patched 7882.121547 > > 08 100 master 71311.491773 patched 71812.899492 > > 08 300 master 69238.839225 patched 70063.632081 > > 08 1000 master 34794.778567 patched 65998.468775 > > 08 3000 master 60834.509571 patched 61165.998080 > > 32 100 master 203168.264456 patched 205258.283852 > > 32 300 master 199137.276025 patched 200391.633074 > > 32 1000 master 177996.853496 patched 176365.732087 > > 32 3000 master 149891.147442 patched 148683.269107 > > > > Something appears to have screwed up my results for 8 clients @ scale > > factor 300 on master, > > Do you want to say the reading of 1000 scale factor? > > >but overall, on both the read-only and > > read-write tests, I'm not seeing anything that resembles the big > gains > > you reported. > > I have not generated numbers for read-write tests, I will check that > once. > For read-only tests, the performance increase is minor and different > from > what I saw. > Few points which I could think of for difference in data: > > 1. In my test's I always observed best data when number of > clients/threads > are equal to number of cores which in your case should be at 16. > 2. I think for scale factor 100 and 300, there should not be much > performance increase, as for them they should mostly get buffer from > freelist inspite of even bgwriter adds to freelist or not. > 3. In my tests variance is for shared buffers, database size is always > less > than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but > due > to variance in shared buffers, it can lead to I/O. > 4. Each run is of 20 minutes, not sure if this has any difference. > > > Tests were run on a 16-core, 64-hwthread PPC64 machine provided to > the > > PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel 3.2.6. > > To think about the difference in your and my runs, could you please > tell me > about below points > 1. What is RAM in machine. > 2. Are number of threads equal to number of clients. > 3. Before starting tests I have always done pre-warming of buffers > (used > pg_prewarm written by you last year), is it same for above read-only > tests. > 4. Can you please once again run only the test where you saw > variation(8 > clients @ scale> factor 1000 on master), because I have also seen that > performance difference is very good for certain > configurations(Scale Factor, RAM, Shared Buffers) On looking more closely at data posted by you, I believe that there is some problem with data (8 clients @ scale factor 1000 on master) as in all other cases, the data for scale factor 1000 is better than 3000 except for this case. So I think no need to run again. > Apart from above, I had one more observation during my investigation to > find > why in some cases, there is a small dip: > 1. Many times, it finds the buffer in free list is not usable, means > it's > refcount or usage count is not zero, due to which it had to spend more > time > under BufFreelistLock. > I had not any further experiments related to this finding like if it > really adds any overhead. > > Currently I am trying to find reasons for small dip of performance and > see > if I could do something to avoid it. Also I will run tests with various > configurations. > > Any other suggestions?
On Tue, May 21, 2013 at 3:06 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> Here are the results. The first field in each line is the number of >> clients. The second number is the scale factor. The numbers after >> "master" and "patched" are the median of three runs. >> >> 01 100 master 1433.297699 patched 1420.306088 >> 01 300 master 1371.286876 patched 1368.910732 >> 01 1000 master 1056.891901 patched 1067.341658 >> 01 3000 master 637.312651 patched 685.205011 >> 08 100 master 10575.017704 patched 11456.043638 >> 08 300 master 9262.601107 patched 9120.925071 >> 08 1000 master 1721.807658 patched 1800.733257 >> 08 3000 master 819.694049 patched 854.333830 >> 32 100 master 26981.677368 patched 27024.507600 >> 32 300 master 14554.870871 patched 14778.285400 >> 32 1000 master 1941.733251 patched 1990.248137 >> 32 3000 master 846.654654 patched 892.554222 > > Is the above test for tpc-b? > In the above tests, there is performance increase from 1~8% and decrease > from 0.2~1.5% It's just the default pgbench workload. >> And here's the same results for 5-minute, read-only tests: >> >> 01 100 master 9361.073952 patched 9049.553997 >> 01 300 master 8640.235680 patched 8646.590739 >> 01 1000 master 8339.364026 patched 8342.799468 >> 01 3000 master 7968.428287 patched 7882.121547 >> 08 100 master 71311.491773 patched 71812.899492 >> 08 300 master 69238.839225 patched 70063.632081 >> 08 1000 master 34794.778567 patched 65998.468775 >> 08 3000 master 60834.509571 patched 61165.998080 >> 32 100 master 203168.264456 patched 205258.283852 >> 32 300 master 199137.276025 patched 200391.633074 >> 32 1000 master 177996.853496 patched 176365.732087 >> 32 3000 master 149891.147442 patched 148683.269107 >> >> Something appears to have screwed up my results for 8 clients @ scale >> factor 300 on master, > > Do you want to say the reading of 1000 scale factor? Yes. >>but overall, on both the read-only and >> read-write tests, I'm not seeing anything that resembles the big gains >> you reported. > > I have not generated numbers for read-write tests, I will check that once. > For read-only tests, the performance increase is minor and different from > what I saw. > Few points which I could think of for difference in data: > > 1. In my test's I always observed best data when number of clients/threads > are equal to number of cores which in your case should be at 16. Sure, but you also showed substantial performance increases across a variety of connection counts, whereas I'm seeing basically no change at any connection count. > 2. I think for scale factor 100 and 300, there should not be much > performance increase, as for them they should mostly get buffer from > freelist inspite of even bgwriter adds to freelist or not. I agree. > 3. In my tests variance is for shared buffers, database size is always less > than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due > to variance in shared buffers, it can lead to I/O. Not sure I understand this. > 4. Each run is of 20 minutes, not sure if this has any difference. I've found that 5-minute tests are normally adequate to identify performance changes on the pgbench SELECT-only workload. >> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the >> PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel 3.2.6. > > To think about the difference in your and my runs, could you please tell me > about below points > 1. What is RAM in machine. 64GB > 2. Are number of threads equal to number of clients. Yes. > 3. Before starting tests I have always done pre-warming of buffers (used > pg_prewarm written by you last year), is it same for above read-only tests. No, I did not use pg_prewarm. But I don't think that should matter very much. First, the data was all in the OS cache. Second, on the small scale factors, everything should end up in cache pretty quickly anyway. And on the large scale factors, well, you're going to be churning shared_buffers anyway, so pg_prewarm is only going to affect the very beginning of the test. > 4. Can you please once again run only the test where you saw variation(8 > clients @ scale> factor 1000 on master), because I have also seen that > performance difference is very good for certain > configurations(Scale Factor, RAM, Shared Buffers) I can do this if I get a chance, but I don't really see where that's going to get us. It seems pretty clear to me that there's no benefit on these tests from this patch. So either one of us is doing the benchmarking incorrectly, or there's some difference in our test environments that is significant, but none of the proposals you've made so far seem to me to explain the difference. > Apart from above, I had one more observation during my investigation to find > why in some cases, there is a small dip: > 1. Many times, it finds the buffer in free list is not usable, means it's > refcount or usage count is not zero, due to which it had to spend more time > under BufFreelistLock. > I had not any further experiments related to this finding like if it > really adds any overhead. > > Currently I am trying to find reasons for small dip of performance and see > if I could do something to avoid it. Also I will run tests with various > configurations. > > Any other suggestions? Well, I think that the code in SyncOneBuffer is not really optimal. In some cases you actually lock and unlock the buffer header an extra time, which seems like a whole lotta extra overhead. In fact, I don't think you should be modifying SyncOneBuffer() at all, because that affects not only the background writer but also checkpoints. Presumably it is not right to put every unused buffer on the free list when we checkpoint. Instead, I suggest modifying BgBufferSync, specifically this part right here: else if (buffer_state & BUF_REUSABLE) reusable_buffers++; What I would suggest is that if the BUF_REUSABLE flag is set here, use that as the trigger to do StrategyMoveBufferToFreeListEnd(). That's much simpler than the logic that you have now, and I think it's also more efficient and more correct. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/14/13 2:13 PM, Greg Smith wrote: > It is possible that we are told to put something in the freelist that > is already in it; don't screw up the list if so. > > I don't see where the code does anything to handle that though. What was your intention here? IIRC, the code that pulls from the freelist already deals with the possibility that a block was on the freelist but has sincebeen put to use. If that's the case then there shouldn't be much penalty to adding a block multiple times (at leastwithin reason...) -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thursday, May 23, 2013 8:45 PM Robert Haas wrote: > On Tue, May 21, 2013 at 3:06 AM, Amit Kapila <amit.kapila@huawei.com> > wrote: > >> Here are the results. The first field in each line is the number of > >> clients. The second number is the scale factor. The numbers after > >> "master" and "patched" are the median of three runs. > > >>but overall, on both the read-only and > >> read-write tests, I'm not seeing anything that resembles the big > gains > >> you reported. > > > > I have not generated numbers for read-write tests, I will check that > once. > > For read-only tests, the performance increase is minor and different > from > > what I saw. > > Few points which I could think of for difference in data: > > > > 1. In my test's I always observed best data when number of > clients/threads > > are equal to number of cores which in your case should be at 16. > > Sure, but you also showed substantial performance increases across a > variety of connection counts, whereas I'm seeing basically no change > at any connection count. > > 2. I think for scale factor 100 and 300, there should not be much > > performance increase, as for them they should mostly get buffer from > > freelist inspite of even bgwriter adds to freelist or not. > > I agree. > > > 3. In my tests variance is for shared buffers, database size is > always less > > than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), > but due > > to variance in shared buffers, it can lead to I/O. > > Not sure I understand this. What I wanted to say is that all your tests was on same shared buffer configuration 8GB, where as in my tests I was trying to vary shared buffers as well. However this is not important point, as it should show performance gain on configuration you ran, if there is any real benefit of this patch. > > 4. Each run is of 20 minutes, not sure if this has any difference. > > I've found that 5-minute tests are normally adequate to identify > performance changes on the pgbench SELECT-only workload. > > >> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to > the > >> PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel > 3.2.6. > > > > To think about the difference in your and my runs, could you please > tell me > > about below points > > 1. What is RAM in machine. > > 64GB > > > 2. Are number of threads equal to number of clients. > > Yes. > > > 3. Before starting tests I have always done pre-warming of buffers > (used > > pg_prewarm written by you last year), is it same for above read-only > tests. > > No, I did not use pg_prewarm. But I don't think that should matter > very much. First, the data was all in the OS cache. Second, on the > small scale factors, everything should end up in cache pretty quickly > anyway. And on the large scale factors, well, you're going to be > churning shared_buffers anyway, so pg_prewarm is only going to affect > the very beginning of the test. > > > 4. Can you please once again run only the test where you saw > variation(8 > > clients @ scale> factor 1000 on master), because I have also seen > that > > performance difference is very good for certain > > configurations(Scale Factor, RAM, Shared Buffers) > > I can do this if I get a chance, but I don't really see where that's > going to get us. It seems pretty clear to me that there's no benefit > on these tests from this patch. So either one of us is doing the > benchmarking incorrectly, or there's some difference in our test > environments that is significant, but none of the proposals you've > made so far seem to me to explain the difference. Sorry for requesting you to run again without any concrete point. I realized after reading data you posted more carefully that the reading was just some m/c problem or something else, but actually there is no gain. After your post, I had tried with various configurations on different m/c, but till now I am not able see the performance gain as was shown in my initial mail. Infact I had tried on same m/c as well, it some times give good data. I will update you if I get any concrete reason and results. > > Apart from above, I had one more observation during my investigation > to find > > why in some cases, there is a small dip: > > 1. Many times, it finds the buffer in free list is not usable, means > it's > > refcount or usage count is not zero, due to which it had to spend > more time > > under BufFreelistLock. > > I had not any further experiments related to this finding like if > it > > really adds any overhead. > > > > Currently I am trying to find reasons for small dip of performance > and see > > if I could do something to avoid it. Also I will run tests with > various > > configurations. > > > > Any other suggestions? > > Well, I think that the code in SyncOneBuffer is not really optimal. > In some cases you actually lock and unlock the buffer header an extra > time, which seems like a whole lotta extra overhead. In fact, I don't > think you should be modifying SyncOneBuffer() at all, because that > affects not only the background writer but also checkpoints. > Presumably it is not right to put every unused buffer on the free list > when we checkpoint. > > Instead, I suggest modifying BgBufferSync, specifically this part right > here: > > else if (buffer_state & BUF_REUSABLE) > reusable_buffers++; > > What I would suggest is that if the BUF_REUSABLE flag is set here, use > that as the trigger to do StrategyMoveBufferToFreeListEnd(). I think at this point also we need to lock buffer header to check refcount and usage_count before moving to freelist, or do you think it is not required? > That's > much simpler than the logic that you have now, and I think it's also > more efficient and more correct. Sure, I will try the logic suggested by you. With Regards, Amit Kapila.
On Friday, May 24, 2013 2:47 AM Jim Nasby wrote: > On 5/14/13 2:13 PM, Greg Smith wrote: > > It is possible that we are told to put something in the freelist that > > is already in it; don't screw up the list if so. > > > > I don't see where the code does anything to handle that though. What > was your intention here? > > IIRC, the code that pulls from the freelist already deals with the > possibility that a block was on the freelist but has since been put to > use. You are right, the check exists in StrategyGetBuffer() >If that's the case then there shouldn't be much penalty to adding > a block multiple times (at least within reason...) There is a check in StrategyFreeBuffer() which will not allow to put multiple times, I had just used the same check in new function. With Regards, Amit Kapila.
On 5/14/13 8:42 AM, Amit Kapila wrote: > In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist.I have implemented a new API StrategyMoveBufferToFreeListEnd() to > > move buffer’s to end of freelist. > Instead of a separate function, would it be better to add an argument to StrategyFreeBuffer? ISTM this is similar to theother strategy stuff in the buffer manager, so perhaps it should mirror that... -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Friday, May 24, 2013 8:22 PM Jim Nasby wrote: On 5/14/13 8:42 AM, Amit Kapila wrote: >> In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist.I have implemented a new API StrategyMoveBufferToFreeListEnd() to >> >> move buffer’s to end of freelist. >> > Instead of a separate function, would it be better to add an argument to StrategyFreeBuffer? Yes, it could be done with a parameter which will decide whether to put buffer at head or tail in freelist. However currentlythe main focus is to check in which cases this optimization can give benefit. Robert had ran tests for quite a numberof cases where it doesn't show any significant gain. I am also trying with various configurations to see if it givesany benefit. Robert has given some suggestions to change the way currently new function is getting called, I will tryit and update the results of same. I am not very sure that default pgbench is a good test scenario to test this optimization. If you have any suggestions fortests where it can show benefit, that would be a great input. > ISTM this is similar to the other strategy stuff in the buffer manager, so perhaps it should mirror that... With Regards, Amit Kapila.
>> Instead, I suggest modifying BgBufferSync, specifically this part right >> here: >> >> else if (buffer_state & BUF_REUSABLE) >> reusable_buffers++; >> >> What I would suggest is that if the BUF_REUSABLE flag is set here, use >> that as the trigger to do StrategyMoveBufferToFreeListEnd(). > > I think at this point also we need to lock buffer header to check refcount > and usage_count before moving to freelist, or do you think it is not > required? If BUF_REUSABLE is set, that means we just did exactly what you're saying. Why do it twice? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tuesday, May 28, 2013 6:54 PM Robert Haas wrote: > >> Instead, I suggest modifying BgBufferSync, specifically this part > right > >> here: > >> > >> else if (buffer_state & BUF_REUSABLE) > >> reusable_buffers++; > >> > >> What I would suggest is that if the BUF_REUSABLE flag is set here, > use > >> that as the trigger to do StrategyMoveBufferToFreeListEnd(). > > > > I think at this point also we need to lock buffer header to check > refcount > > and usage_count before moving to freelist, or do you think it is not > > required? > > If BUF_REUSABLE is set, that means we just did exactly what you're > saying. Why do it twice? Even if we just did it, but we have released the buf header lock, so theoretically chances are there that backend can increase the count, however still it will be protected by check in StrategyGetBuffer(). As there is a very rare chance of it, so doing without buffer header lock might not cause any harm. Modified patch to address the same is attached with mail. Performance Data ------------------- As far as I have noticed, performance data for this patch depends on 3 factors 1. Pre-loading of data in buffers, so that buffers holding pages should have some usage count before running pgbench. Reason is it might be creating difference in performance of clock-sweep 2. Clearing of pages in OS cache before running pgbench with different patch, it can create difference because when we run pgbench with or without patch, it can access pages already cached due to previous runs which causes variation in performance. 3. Scale factor and shared buffer configuration To avoid above 3 factors in test readings, I used below steps: 1. Initialize the database with scale factor such that database size + shared_buffers = RAM (shared_buffers = 1/4 of RAM). For example: Example -1 if RAM = 128G, then initializedb with scale factor = 6700 and shared_buffers = 32GB. Database size (98 GB) + shared_buffers (32GB) = 130 (which is approximately equal to total RAM) Example -2 (this is based on your test m/c) If RAM = 64GB, then initializedb with scale factor = 3400 and shared_buffers = 16GB. 2. reboot m/c 3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm. I had loaded 3 times, so that usage count of buffers will be approximately 3. Used file load_all_buffers.sql attached with this mail 4. run 3 times pgbench select-only case for 10 or 15 minutes without patch 5. reboot m/c 6. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm. I had loaded 3 times, so that usage count of buffers will be approximately 3. Used file load_all_buffers.sql attached with this mail 7. run 3 times pgbench select-only case for 10 or 15 minutes with patch Using above steps, I had taken performance data on 2 different m/c's Configuration Details O/S - Suse-11 RAM - 128GB Number of Cores - 16 Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min, synchronous_commit = 0FF, shared_buffers = 32GB, AutoVacuum=off Pgbench - Select-only Scalefactor - 1200 Time - Each run is of 15 mins Below data is for average of 3 runs 16C-16T 32C-32T 64C-64T HEAD 4391 3971 3464 After Patch 6147 5093 3944 Detailed data of each run is attached with mail in file move_unused_buffers_to_freelist_v2.htm Below data is for 1 run of half hour on same configuration 16C-16T 32C-32T 64C-64T HEAD 4377 3861 3295 After Patch 6542 4770 3504 Configuration Details O/S - Suse-11 RAM - 24GB Number of Cores - 8 Server Conf - checkpoint_segments = 256; checkpoint_timeout = 25 min, synchronous_commit = 0FF, shared_buffers = 5GB Pgbench - Select-only Scalefactor - 1200 Time - Each run is of 10 mins Below data is for average 3 runs of 10 minutes 8C-8T 16C-16T 32C-32T 64C-64T 128C-128T 256C-256T HEAD 58837 56740 19390 5681 3191 2160 After Patch 59482 56936 25070 7655 4166 2704 Detailed data of each run is attached with mail in file move_unused_buffers_to_freelist_v2.htm Below data is for 1 run of half hour on same configuration 32C-32T HEAD 17703 After Patch 20586 I had run these tests multiple times to ensure the correctness. I think last time why it didn't show performance improvement in your runs is because the way we both are running pgbench is different. This time, I have detailed the steps I have used to collect performance data. With Regards, Amit Kapila.
On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > To avoid above 3 factors in test readings, I used below steps: > 1. Initialize the database with scale factor such that database size + > shared_buffers = RAM (shared_buffers = 1/4 of RAM). > For example: > Example -1 > if RAM = 128G, then initialize db with scale factor = 6700 > and shared_buffers = 32GB. > Database size (98 GB) + shared_buffers (32GB) = 130 (which > is approximately equal to total RAM) > Example -2 (this is based on your test m/c) > If RAM = 64GB, then initialize db with scale factor = 3400 > and shared_buffers = 16GB. > 2. reboot m/c > 3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm. > I had loaded 3 times, so that usage count of buffers will be approximately > 3. Hmm. I don't think the usage count will actually end up being 3, though, because the amount of data you're loading is sized to 3/4 of RAM, and shared_buffers is just 1/4 of RAM, so I think that each run of pg_prewarm will end up turning over the entire cache and you'll never get any usage counts more than 1 this way. Am I confused? I wonder if it would be beneficial to test the case where the database size is just a little more than shared_buffers. I think that would lead to a situation where the usage counts are high most of the time, which - now that you mention it - seems like the sweet spot for this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday, June 24, 2013 11:00 PM Robert Haas wrote: > On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <amit.kapila@huawei.com> > wrote: > > To avoid above 3 factors in test readings, I used below steps: > > 1. Initialize the database with scale factor such that database size > + > > shared_buffers = RAM (shared_buffers = 1/4 of RAM). > > For example: > > Example -1 > > if RAM = 128G, then initialize db with scale factor = > 6700 > > and shared_buffers = 32GB. > > Database size (98 GB) + shared_buffers (32GB) = 130 > (which > > is approximately equal to total RAM) > > Example -2 (this is based on your test m/c) > > If RAM = 64GB, then initialize db with scale factor = > 3400 > > and shared_buffers = 16GB. > > 2. reboot m/c > > 3. Load all buffers with data (tables/indexes of pgbench) using > pg_prewarm. > > I had loaded 3 times, so that usage count of buffers will be > approximately > > 3. > > Hmm. I don't think the usage count will actually end up being 3, > though, because the amount of data you're loading is sized to 3/4 of > RAM, and shared_buffers is just 1/4 of RAM, so I think that each run > of pg_prewarm will end up turning over the entire cache and you'll > never get any usage counts more than 1 this way. Am I confused? The way I am pre-warming is that loading the data of relation (table/index) continuously 3 times, so mostly the buffers will contain the data of relations loaded in last which are indexes and also they got accessed more during scans. So usage count should be 3. Can you please once see load_all_buffers.sql, may be my understanding has some gap. Now about the question why then load all the relations. Apart from PostgreSQL shared buffers, loading data this way can also make sure OS buffers will have the data with higher usage count which can lead to better OS scheduling. > I wonder if it would be beneficial to test the case where the database > size is just a little more than shared_buffers. I think that would > lead to a situation where the usage counts are high most of the time, > which - now that you mention it - seems like the sweet spot for this > patch. I will check this case and take the readings for same. Thanks for your suggestions. With Regards, Amit Kapila.
On Tuesday, June 25, 2013 10:25 AM Amit Kapila wrote: > On Monday, June 24, 2013 11:00 PM Robert Haas wrote: > > On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <amit.kapila@huawei.com> > > wrote: > > > To avoid above 3 factors in test readings, I used below steps: > > > 1. Initialize the database with scale factor such that database > size > > + > > > shared_buffers = RAM (shared_buffers = 1/4 of RAM). > > > For example: > > > Example -1 > > > if RAM = 128G, then initialize db with scale factor > = > > 6700 > > > and shared_buffers = 32GB. > > > Database size (98 GB) + shared_buffers (32GB) = 130 > > (which > > > is approximately equal to total RAM) > > > Example -2 (this is based on your test m/c) > > > If RAM = 64GB, then initialize db with scale factor > = > > 3400 > > > and shared_buffers = 16GB. > > > 2. reboot m/c > > > 3. Load all buffers with data (tables/indexes of pgbench) using > > pg_prewarm. > > > I had loaded 3 times, so that usage count of buffers will be > > approximately > > > 3. > > > > Hmm. I don't think the usage count will actually end up being 3, > > though, because the amount of data you're loading is sized to 3/4 of > > RAM, and shared_buffers is just 1/4 of RAM, so I think that each run > > of pg_prewarm will end up turning over the entire cache and you'll > > never get any usage counts more than 1 this way. Am I confused? > > The way I am pre-warming is that loading the data of relation > (table/index) > continuously 3 times, so mostly the buffers will contain the data of > relations loaded in last > which are indexes and also they got accessed more during scans. So > usage > count should be 3. > Can you please once see load_all_buffers.sql, may be my understanding > has > some gap. > > Now about the question why then load all the relations. > Apart from PostgreSQL shared buffers, loading data this way can also > make sure OS buffers will have the data with higher usage count which > can > lead to better OS scheduling. > > > I wonder if it would be beneficial to test the case where the > database > > size is just a little more than shared_buffers. I think that would > > lead to a situation where the usage counts are high most of the time, > > which - now that you mention it - seems like the sweet spot for this > > patch. > > I will check this case and take the readings for same. Thanks for your > suggestions. Configuration Details O/S - Suse-11 RAM - 128GB Number of Cores - 16 Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min, synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench - Select-only Scalefactor - 1200 Time - 30 mins 8C-8T 16C-16T 32C-32T 64C-64T Head 62403 101810 99516 94707 Patch 62827 101404 99109 94744 On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared buffers, this is no major difference. One of the reasons could be that there is no much swapping in shared buffers as most data already fits in shared buffers. I think more readings are need for combinations related to below settings: scale factor such that database size + shared_buffers = RAM (shared_buffers = 1/4 of RAM). I can try varying shared_buffer size. Kindly let me know your suggestions? With Regards, Amit Kapila.
On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > Configuration Details > O/S - Suse-11 > RAM - 128GB > Number of Cores - 16 > Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min, > synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench - > Select-only Scalefactor - 1200 Time - 30 mins > > 8C-8T 16C-16T 32C-32T 64C-64T > Head 62403 101810 99516 94707 > Patch 62827 101404 99109 94744 > > On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared > buffers, this is no major difference. > One of the reasons could be that there is no much swapping in shared buffers > as most data already fits in shared buffers. I'd like to just back up a minute here and talk about the broader picture here. What are we trying to accomplish with this patch? Last year, I did some benchmarking on a big IBM POWER7 machine (16 cores, 64 hardware threads). Here are the results: http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html Now, if you look at these results, you see something interesting. When there aren't too many concurrent connections, the higher scale factors are only modestly slower than the lower scale factors. But as the number of connections increases, the performance continues to rise at the lower scale factors, and at the higher scale factors, this performance stops rising and in fact drops off. So in other words, there's no huge *performance* problem for a working set larger than shared_buffers, but there is a huge *scalability* problem. Now why is that? As far as I can tell, the answer is that we've got a scalability problem around BufFreelistLock. Contention on the buffer mapping locks may also be a problem, but all of my previous benchmarking (with LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant in the room. My interest in having the background writer add buffers to the free list is basically around solving that problem. It's a pretty dramatic problem, as the graph above shows, and this patch doesn't solve it. There may be corner cases where this patch improves things (or, equally, makes them worse) but as a general point, the difficulty I've had reproducing your test results and the specificity of your instructions for reproducing them suggests to me that what we have here is not a clear improvement on general workloads. Yet such an improvement should exist, because there are other products in the world that have scalable buffer managers; we currently don't. Instead of spending a lot of time trying to figure out whether there's a small win in narrow cases here (and there may well be), I think we should back up and ask why this isn't a great big win, and what we'd need to do to *get* a great big win. I don't see much point in tinkering around the edges here if things are broken in the middle; things that seem like small wins or losses now may turn out otherwise in the face of a more comprehensive solution. One thing that occurred to me while writing this note is that the background writer doesn't have any compelling reason to run on a read-only workload. It will still run at a certain minimum rate, so that it cycles the buffer pool every 2 minutes, if I remember correctly. But it won't run anywhere near fast enough to keep up with the buffer allocation demands of 8, or 32, or 64 sessions all reading data not all of which is in shared_buffers at top speed. In fact, we've had reports that the background writer isn't too effective even on read-write workloads. The point is - if the background writer isn't waking up and running frequently enough, what it does when it does wake up isn't going to matter very much. I think we need to spend some energy poking at that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-27 08:23:31 -0400, Robert Haas wrote: > I'd like to just back up a minute here and talk about the broader > picture here. Sounds like a very good plan. > So in other words, > there's no huge *performance* problem for a working set larger than > shared_buffers, but there is a huge *scalability* problem. Now why is > that? > As far as I can tell, the answer is that we've got a scalability > problem around BufFreelistLock. Part of the problem is it's name ;) > Contention on the buffer mapping > locks may also be a problem, but all of my previous benchmarking (with > LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant > in the room. Contention wise I aggree. What I have seen is that we have a huge amount of cacheline bouncing around the buffer header spinlocks. > My interest in having the background writer add buffers > to the free list is basically around solving that problem. It's a > pretty dramatic problem, as the graph above shows, and this patch > doesn't solve it. > One thing that occurred to me while writing this note is that the > background writer doesn't have any compelling reason to run on a > read-only workload. It will still run at a certain minimum rate, so > that it cycles the buffer pool every 2 minutes, if I remember > correctly. I have previously added some adhoc instrumentation that printed the amount of buffers that were required (by other backends) during a bgwriter cycle and the amount of buffers that the buffer manager could actually write out. I don't think I actually found any workload where the bgwriter actually wroute out a relevant percentage of the necessary pages. Which would explain why the patch doesn't have a big benefit. The freelist is empty most of the time, so we don't benefit from the reduced work done under the lock. I think the whole algorithm that guides how much the background writer actually does, including its pacing/sleeping logic, needs to be rewritten from scratch before we are actually able to measure the benefit from this patch. I personally don't think there's much to salvage from the current code. Problems with the current code: * doesn't manipulate the usage_count and never does anything to used pages. Which means it will just about never find a victimbuffer in a busy database. * by far not aggressive enough, touches only a few buffers ahead of the clock sweep. * does not advance the clock sweep, so the individual backends will touch the same buffers again and transfer all the bufferspinlock cacheline around * The adaption logic it has makes it so slow to adapt that it takes several minutes to adapt. * ... There's another thing we could do to noticeably improve scalability of buffer acquiration. Currently we do a huge amount of work under the freelist lock. In StrategyGetBuffer: LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); ... // check freelist, will usually be empty ... for (;;) { buf = &BufferDescriptors[StrategyControl->nextVictimBuffer]; ++StrategyControl->nextVictimBuffer; LockBufHdr(buf); if (buf->refcount == 0) { if (buf->usage_count > 0) { buf->usage_count--; } else { /* Found a usable buffer */ if(strategy != NULL) AddBufferToRing(strategy, buf); return buf; } } UnlockBufHdr(buf); } So, we perform the entire clock sweep until we found a single buffer we can use inside a *global* lock. At times we need to iterate over the whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all the usage counts enough (if the database is busy it can take even longer...). In a busy database where usually all the usagecounts are high the next backend will touch a lot of those buffers again which causes massive cache eviction & bouncing. It seems far more sensible to only protect the clock sweep's nextVictimBuffer with a spinlock. With some care all the rest can happen without any global interlock. I think even after fixing this - which we definitely should do - having a sensible/more aggressive bgwriter moving pages onto the freelist makes sense because then backends then don't need to deal with dirty pages. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Contention wise I aggree. What I have seen is that we have a huge > amount of cacheline bouncing around the buffer header spinlocks. How did you measure that? > I have previously added some adhoc instrumentation that printed the > amount of buffers that were required (by other backends) during a > bgwriter cycle and the amount of buffers that the buffer manager could > actually write out. I think you can see how many are needed from buffers_alloc. No? > I don't think I actually found any workload where > the bgwriter actually wroute out a relevant percentage of the necessary > pages. Check. > Problems with the current code: > > * doesn't manipulate the usage_count and never does anything to used > pages. Which means it will just about never find a victim buffer in a > busy database. Right. I was thinking that was part of this patch, but it isn't. I think we should definitely add that. In other words, the background writer's job should be to run the clock sweep and add buffers to the free list. I think we should also split the lock: a spinlock for the freelist, and an lwlock for the clock sweep. > * by far not aggressive enough, touches only a few buffers ahead of the > clock sweep. Check. Fixing this might be a separate patch, but then again maybe not. The changes we're talking about here provide a natural feedback mechanism: if we observe that the freelist is empty (or less than some length, like 32 buffers?) set the background writer's latch, because we know it's not keeping up. > * does not advance the clock sweep, so the individual backends will > touch the same buffers again and transfer all the buffer spinlock > cacheline around Yes, I think that should be fixed as part of this patch too. It's obviously connected to the point about usage counts. > * The adaption logic it has makes it so slow to adapt that it takes > several minutes to adapt. Yeah. I don't know if fixing that will fall naturally out of these other changes or not, but I think it's a second-order concern in any event. > There's another thing we could do to noticeably improve scalability of > buffer acquiration. Currently we do a huge amount of work under the > freelist lock. > In StrategyGetBuffer: > LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > ... > // check freelist, will usually be empty > ... > for (;;) > { > buf = &BufferDescriptors[StrategyControl->nextVictimBuffer]; > > ++StrategyControl->nextVictimBuffer; > > LockBufHdr(buf); > if (buf->refcount == 0) > { > if (buf->usage_count > 0) > { > buf->usage_count--; > } > else > { > /* Found a usable buffer */ > if (strategy != NULL) > AddBufferToRing(strategy, buf); > return buf; > } > } > UnlockBufHdr(buf); > } > > So, we perform the entire clock sweep until we found a single buffer we > can use inside a *global* lock. At times we need to iterate over the > whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all > the usage counts enough (if the database is busy it can take even > longer...). > In a busy database where usually all the usagecounts are high the next > backend will touch a lot of those buffers again which causes massive > cache eviction & bouncing. > > It seems far more sensible to only protect the clock sweep's > nextVictimBuffer with a spinlock. With some care all the rest can happen > without any global interlock. That's a lot more spinlock acquire/release cycles, but it might work out to a win anyway. Or it might lead to the system suffering a horrible spinlock-induced death spiral on eviction-heavy workloads. > I think even after fixing this - which we definitely should do - having > a sensible/more aggressive bgwriter moving pages onto the freelist makes > sense because then backends then don't need to deal with dirty pages. Or scanning to find evictable pages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-27 09:50:32 -0400, Robert Haas wrote: > On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Contention wise I aggree. What I have seen is that we have a huge > > amount of cacheline bouncing around the buffer header spinlocks. > > How did you measure that? perf record -e cache-misses. If you want it more detailed looking at {L1,LLC}-{load,store}{s,misses} can sometimes be helpful too. Also, running perf stat -vvv postgres -D ... for a whole benchmark can be useful to compare how much a change influences cache misses and such. For very detailed analysis running something under valgrind/cachegrind can be helpful too, but I usually find perf to be sufficient. > > I have previously added some adhoc instrumentation that printed the > > amount of buffers that were required (by other backends) during a > > bgwriter cycle and the amount of buffers that the buffer manager could > > actually write out. > > I think you can see how many are needed from buffers_alloc. No? Not easily correlated with bgwriter activity. If we cannot keep up because it's 100% busy writing out buffers I don't have many problems with that. But I don't think we often are. > > Problems with the current code: > > > > * doesn't manipulate the usage_count and never does anything to used > > pages. Which means it will just about never find a victim buffer in a > > busy database. > > Right. I was thinking that was part of this patch, but it isn't. I > think we should definitely add that. In other words, the background > writer's job should be to run the clock sweep and add buffers to the > free list. We might need to split it into two for that. One process to writeout dirty pages, one to populate the freelist. Otherwise we will probably regularly hit the current scalability issues because we're currently io contended. Say during a busy or even immediate checkpoint. > I think we should also split the lock: a spinlock for the > freelist, and an lwlock for the clock sweep. Yea, thought about that when writing the thing about the exclusive lock during the clocksweep. > > * by far not aggressive enough, touches only a few buffers ahead of the > > clock sweep. > > Check. Fixing this might be a separate patch, but then again maybe > not. The changes we're talking about here provide a natural feedback > mechanism: if we observe that the freelist is empty (or less than some > length, like 32 buffers?) set the background writer's latch, because > we know it's not keeping up. Yes, that makes sense. Also provides adaptability to bursty workloads which means we don't have too complex logic in the bgwriter for that. > > There's another thing we could do to noticeably improve scalability of > > buffer acquiration. Currently we do a huge amount of work under the > > freelist lock. > > ... > > So, we perform the entire clock sweep until we found a single buffer we > > can use inside a *global* lock. At times we need to iterate over the > > whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all > > the usage counts enough (if the database is busy it can take even > > longer...). > > In a busy database where usually all the usagecounts are high the next > > backend will touch a lot of those buffers again which causes massive > > cache eviction & bouncing. > > > > It seems far more sensible to only protect the clock sweep's > > nextVictimBuffer with a spinlock. With some care all the rest can happen > > without any global interlock. > > That's a lot more spinlock acquire/release cycles, but it might work > out to a win anyway. Or it might lead to the system suffering a > horrible spinlock-induced death spiral on eviction-heavy workloads. I can't imagine it to be worse that what we have today. Also, nobody requires us to only advance the clocksweep by one page, we can easily do it say 29 pages at a time or so if we detect the lock is contended. Alternatively it shouldn't be too hard to make it into an atomic increment, although that requires some trickery to handle the wraparound sanely. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> wrote: > I don't think I actually found any workload where the bgwriter > actually wroute out a relevant percentage of the necessary pages. I had one at Wisconsin Courts. The database which we targeted with logical replication from the 72 circuit court databases (plus a few others) on six database connection pool with about 20 to (at peaks) hundreds of transactions per second modifying the database (the average transaction involving about 20 modifying statements with potentially hundreds of affected rows), with maybe 2000 to 3000 queries per second on a 30 connection pool, wrote about one-third each of the dirty buffers with checkpoints, background writer, and backends needing to read a page. I shared my numbers with Greg, who I believe used them as one of his examples for how to tune memory, checkpoints, and background writer, so you might want to check with him if you want more detail. Of course, we set bgwriter_lru_maxpages = 1000 and bgwriter_lru_multiplier = 4, and kept shared_buffers to 2GB to hit that. Without the reduced shared_buffers and more aggressive bgwriter we hit the problem with writes overwhelming the RAID controller's cache and causing everything in the database to "freeze" until it cleared some cache space. I'm not saying this invalidates your general argument; just that such cases do exist. Hopefully this data point is useful. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thursday, June 27, 2013 5:54 PM Robert Haas wrote: > On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila <amit.kapila@huawei.com> > wrote: > > Configuration Details > > O/S - Suse-11 > > RAM - 128GB > > Number of Cores - 16 > > Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min, > > synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off > Pgbench - > > Select-only Scalefactor - 1200 Time - 30 mins > > > > 8C-8T 16C-16T 32C-32T 64C- > 64T > > Head 62403 101810 99516 94707 > > Patch 62827 101404 99109 94744 > > > > On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB > shared > > buffers, this is no major difference. > > One of the reasons could be that there is no much swapping in shared > buffers > > as most data already fits in shared buffers. > > I'd like to just back up a minute here and talk about the broader > picture here. What are we trying to accomplish with this patch? Last > year, I did some benchmarking on a big IBM POWER7 machine (16 cores, > 64 hardware threads). Here are the results: > > http://rhaas.blogspot.com/2012/03/performance-and-scalability-on- > ibm.html > > Now, if you look at these results, you see something interesting. > When there aren't too many concurrent connections, the higher scale > factors are only modestly slower than the lower scale factors. But as > the number of connections increases, the performance continues to rise > at the lower scale factors, and at the higher scale factors, this > performance stops rising and in fact drops off. So in other words, > there's no huge *performance* problem for a working set larger than > shared_buffers, but there is a huge *scalability* problem. Now why is > that? > > As far as I can tell, the answer is that we've got a scalability > problem around BufFreelistLock. Contention on the buffer mapping > locks may also be a problem, but all of my previous benchmarking (with > LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant > in the room. My interest in having the background writer add buffers > to the free list is basically around solving that problem. It's a > pretty dramatic problem, as the graph above shows, and this patch > doesn't solve it. There may be corner cases where this patch improves > things (or, equally, makes them worse) but as a general point, the > difficulty I've had reproducing your test results and the specificity > of your instructions for reproducing them suggests to me that what we > have here is not a clear improvement on general workloads. Yet such > an improvement should exist, because there are other products in the > world that have scalable buffer managers; we currently don't. Instead > of spending a lot of time trying to figure out whether there's a small > win in narrow cases here (and there may well be), I think we should > back up and ask why this isn't a great big win, and what we'd need to > do to *get* a great big win. I don't see much point in tinkering > around the edges here if things are broken in the middle; things that > seem like small wins or losses now may turn out otherwise in the face > of a more comprehensive solution. > > One thing that occurred to me while writing this note is that the > background writer doesn't have any compelling reason to run on a > read-only workload. It will still run at a certain minimum rate, so > that it cycles the buffer pool every 2 minutes, if I remember > correctly. But it won't run anywhere near fast enough to keep up with > the buffer allocation demands of 8, or 32, or 64 sessions all reading > data not all of which is in shared_buffers at top speed. In fact, > we've had reports that the background writer isn't too effective even > on read-write workloads. The point is - if the background writer > isn't waking up and running frequently enough, what it does when it > does wake up isn't going to matter very much. I think we need to > spend some energy poking at that. Currently it wakes up based on bgwriterdelay config parameter which is by default 200ms, so you means we should think of waking up bgwriter based on allocations and number of elements left in freelist? As per my understanding Summarization of points raised by you and Andres which this patch should address to have a bigger win: 1. Bgwriter needs to be improved so that it can help in reducing usage count and finding next victim buffer (run the clock sweep and add buffers to the free list). 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are less. 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer (a spinlock for the freelist, and an lwlock for theclock sweep). 4. Separate processes for writing dirty buffers and moving buffers to freelist 5. Bgwriter needs to be more aggressive, logic based on which it calculates how many buffers it needs to process needs to be improved. 6. There can be contention around buffer mapping locks, but we can focus on it later 7. cacheline bouncing around the buffer header spinlocks, is there anything we can do to reduce this? Kindly let me know if I have missed any point. With Regards, Amit Kapila.
On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > Currently it wakes up based on bgwriterdelay config parameter which is by > default 200ms, so you means we should > think of waking up bgwriter based on allocations and number of elements left > in freelist? I think that's what Andres and I are proposing, yes. > As per my understanding Summarization of points raised by you and Andres > which this patch should address to have a bigger win: > > 1. Bgwriter needs to be improved so that it can help in reducing usage count > and finding next victim buffer > (run the clock sweep and add buffers to the free list). Check. > 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are > less. Check. The way to do this is to keep a variable in shared memory in the same cache line as the spinlock protecting the freelist, and update it when you update the free list. > 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer > (a spinlock for the freelist, and an lwlock for the clock sweep). Check. > 4. Separate processes for writing dirty buffers and moving buffers to > freelist I think this part might be best pushed to a separate patch, although I agree we probably need it. > 5. Bgwriter needs to be more aggressive, logic based on which it calculates > how many buffers it needs to process needs to be improved. This is basically overlapping with points already made. I suspect we could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and bgwriter_lru_multiplier altogether. The background writer would just have a high and a low watermark. When the number of buffers on the freelist drops below the low watermark, the allocating backend sets the latch and bgwriter wakes up and begins adding buffers to the freelist. When the number of buffers on the free list reaches the high watermark, the background writer goes back to sleep. Some experimentation might be needed to figure out what values are appropriate for those watermarks. In theory this could be a configuration knob, but I suspect it's better to just make the system tune it right automatically. > 6. There can be contention around buffer mapping locks, but we can focus on > it later > 7. cacheline bouncing around the buffer header spinlocks, is there anything > we can do to reduce this? I think these are points that we should leave for the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 28, 2013 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> Currently it wakes up based on bgwriterdelay config parameter which is by >> default 200ms, so you means we should >> think of waking up bgwriter based on allocations and number of elements left >> in freelist? > > I think that's what Andres and I are proposing, yes. Incidentally, I'm going to mark this patch Returned with Feedback in the CF application. I think this line of inquiry has potential, but clearly there's a lot more work to do here before we commit anything, and I don't think that's going to happen in the next few weeks. But let's keep discussing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/28/13 8:50 AM, Robert Haas wrote: > On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> 4. Separate processes for writing dirty buffers and moving buffers to >> freelist > > I think this part might be best pushed to a separate patch, although I > agree we probably need it. This might be necessary eventually, but it's going to make thing more complicated. And I don't think it's a blocker for creating something useful. The two most common workloads are: 1) Lots of low usage count data, typically data that is updated sparsely across a larger database. These are helped by a process that writes dirty buffers in the background. These benefit from the current background writer. Kevin's system he was just mentioning again is the best example of this type that there's public data on. 2) Lots of high usage count data, because there are large hotspots in things like index blocks. Most writes happen at checkpoint time, because the background writer won't touch them. Because there are only a small number of re-usable pages, the clock sweep goes around very fast looking for them. This is the type of workload that should benefit from putting buffers into the free list. pgbench provides a simple example of this type, which is why Amit's tests using it have been useful. If you had a process that tried to handle both background writes and freelist management, I suspect one path would be hot and the other almost idle in each type of system. I don't expect that splitting those into two separate process would buy a lot of value, that can easily be pushed to a later patch. > The background writer would just > have a high and a low watermark. When the number of buffers on the > freelist drops below the low watermark, the allocating backend sets > the latch and bgwriter wakes up and begins adding buffers to the > freelist. When the number of buffers on the free list reaches the > high watermark, the background writer goes back to sleep. This will work fine for all of the common workloads. The main challenge is keeping the buffer allocation counting from turning into a hotspot. Busy systems now can easily hit 100K buffer allocations/second. I'm not too worried about it because those allocations are making the free list lock a hotspot right now. One of the consistently controversial parts of the current background writer is how it tries to loop over the buffer cache every 2 minutes, regardless of activity level. The idea there was that on bursty workloads, buffers would be cleaned during idle periods with that mechanism. Part of why that's in there is to deal with the relatively long pause between background writer runs. This refactoring idea will make that hard to keep around. I think this is OK though. Switching to a latch based design should eliminate the bgwriter_delay, which means you won't have this worst case of a 200ms stall while heavy activity is incoming. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Fri, Jun 28, 2013 at 12:10 PM, Greg Smith <greg@2ndquadrant.com> wrote: > This refactoring idea will make that hard to keep around. I think this is > OK though. Switching to a latch based design should eliminate the > bgwriter_delay, which means you won't have this worst case of a 200ms stall > while heavy activity is incoming. I'm a strong proponent of that 2 minute cycle, so I'd vote for finding a way to keep it around. But I don't think that (or 200 ms wakeups) should be the primary thing driving the background writer, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, June 28, 2013 6:20 PM Robert Haas wrote: On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >> Currently it wakes up based on bgwriterdelay config parameter which is by >> default 200ms, so you means we should >> think of waking up bgwriter based on allocations and number of elements left >> in freelist? > I think that's what Andres and I are proposing, yes. >> As per my understanding Summarization of points raised by you and Andres >> which this patch should address to have a bigger win: >> >> 1. Bgwriter needs to be improved so that it can help in reducing usage count >> and finding next victim buffer >> (run the clock sweep and add buffers to the free list). > Check. >> 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are >> less. >Check. The way to do this is to keep a variable in shared memory in >the same cache line as the spinlock protecting the freelist, and >update it when you update the free list. >> 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer >> (a spinlock for the freelist, and an lwlock for the clock sweep). >Check. >> 4. Separate processes for writing dirty buffers and moving buffers to >> freelist > I think this part might be best pushed to a separate patch, although I > agree we probably need it. >> 5. Bgwriter needs to be more aggressive, logic based on which it calculates >> how many buffers it needs to process needs to be improved. > This is basically overlapping with points already made. I suspect we > could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and > bgwriter_lru_multiplier altogether. The background writer would just > have a high and a low watermark. When the number of buffers on the > freelist drops below the low watermark, the allocating backend sets > the latch and bgwriter wakes up and begins adding buffers to the > freelist. When the number of buffers on the free list reaches the > high watermark, the background writer goes back to sleep. Some > experimentation might be needed to figure out what values are > appropriate for those watermarks. In theory this could be a > configuration knob, but I suspect it's better to just make the system > tune it right automatically. Do you think it will be sufficient to just wake bgwriter when the buffers in freelist drops below low watermark, how about it's current job of flushing dirty buffers? I mean to ask that if for some scenario where there are sufficient buffers in freelist, but most other buffers are dirty, will delaying flush untill number of buffers fall below low watermark is okay. >> 6. There can be contention around buffer mapping locks, but we can focus on >> it later >> 7. cacheline bouncing around the buffer header spinlocks, is there anything >> we can do to reduce this? > I think these are points that we should leave for the future. with Regards, Amit Kapila.
On Friday, June 28, 2013 6:38 PM Robert Haas wrote: On Fri, Jun 28, 2013 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> wrote: >>> Currently it wakes up based on bgwriterdelay config parameter which is by >>> default 200ms, so you means we should >>> think of waking up bgwriter based on allocations and number of elements left >>> in freelist? > >> I think that's what Andres and I are proposing, yes. > Incidentally, I'm going to mark this patch Returned with Feedback in >the CF application. Many thanks to you and Andres for providing valuable suggestions. >I think this line of inquiry has potential, but >clearly there's a lot more work to do here before we commit anything, >and I don't think that's going to happen in the next few weeks. But >let's keep discussing. Sure. With Regards, Amit Kapila.
On Sun, Jun 30, 2013 at 3:24 AM, Amit kapila <amit.kapila@huawei.com> wrote: > Do you think it will be sufficient to just wake bgwriter when the buffers in freelist drops > below low watermark, how about it's current job of flushing dirty buffers? Well, the only point of flushing dirty buffers in the background writer is to make sure that backends can allocate buffers quickly. If there are clean buffers already in the freelist, that's not a concern.So... > I mean to ask that if for some scenario where there are sufficient buffers in freelist, but most > other buffers are dirty, will delaying flush untill number of buffers fall below low watermark is okay. ...I think this is OK, or at least we should assume it's OK until we have evidence that it isn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tuesday, July 02, 2013 12:00 AM Robert Haas wrote: > On Sun, Jun 30, 2013 at 3:24 AM, Amit kapila <amit.kapila@huawei.com> > wrote: > > Do you think it will be sufficient to just wake bgwriter when the > buffers in freelist drops > > below low watermark, how about it's current job of flushing dirty > buffers? > > Well, the only point of flushing dirty buffers in the background > writer is to make sure that backends can allocate buffers quickly. If > there are clean buffers already in the freelist, that's not a concern. > So... > > > I mean to ask that if for some scenario where there are sufficient > buffers in freelist, but most > > other buffers are dirty, will delaying flush untill number of buffers > fall below low watermark is okay. > > ...I think this is OK, or at least we should assume it's OK until we > have evidence that it isn't. Sure, after completing my other review work of Commit Fest, I will devise the solution for the suggestions summarized in previous mail and then start a discussion about same. With Regards, Amit Kapila.
On 28 June 2013 05:52, Amit Kapila <amit.kapila@huawei.com> wrote:
which this patch should address to have a bigger win:As per my understanding Summarization of points raised by you and Andres
1. Bgwriter needs to be improved so that it can help in reducing usage count
and finding next victim buffer
(run the clock sweep and add buffers to the free list).
2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
less.
3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
(a spinlock for the freelist, and an lwlock for the clock sweep).
4. Separate processes for writing dirty buffers and moving buffers to
freelist
5. Bgwriter needs to be more aggressive, logic based on which it calculates
how many buffers it needs to process needs to be improved.
6. There can be contention around buffer mapping locks, but we can focus on
it later
7. cacheline bouncing around the buffer header spinlocks, is there anything
we can do to reduce this?
My perspectives here would be
* BufFreelistLock is a huge issue. Finding a next victim block needs to be an O(1) operation, yet it is currently much worse than that. Measuring contention on that lock hides that problem, since having shared buffers lock up for 100ms or more but only occasionally is a huge problem, even if it doesn't occur frequently enough for the averaged contention to show as an issue.
* I'm more interested in reducing response time spikes than in increasing throughput. It's easy to overload a benchmark so we get better throughput numbers, but that's not helpful if we make the system more bursty.
* bgwriter's effectiveness is not guaranteed. We have many clear cases where it is useless. So the question should be to continually answer the question: do we need a bgwriter and if so, what should it do? The fact we have one already doesn't mean it should be given things to do. It is a possible option that things may be better if it did nothing. (Not saying that is true, just that we must consider that optione ach time).
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, July 03, 2013 12:27 PM Simon Riggs wrote: On 28 June 2013 05:52, Amit Kapila <amit.kapila@huawei.com> wrote: >> As per my understanding Summarization of points raised by you and Andres >> which this patch should address to have a bigger win: >> 1. Bgwriter needs to be improved so that it can help in reducing usage count >> and finding next victim buffer >> (run the clock sweep and add buffers to the free list). >>2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are >>less. >>3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer >> (a spinlock for the freelist, and an lwlock for the clock sweep). >>4. Separate processes for writing dirty buffers and moving buffers to >>freelist >>5. Bgwriter needs to be more aggressive, logic based on which it calculates >>how many buffers it needs to process needs to be improved. >>6. There can be contention around buffer mapping locks, but we can focus on >>it later >>7. cacheline bouncing around the buffer header spinlocks, is there anything >>we can do to reduce this? >My perspectives here would be > * BufFreelistLock is a huge issue. Finding a next victim block needs to be an O(1) operation, yet it is currently much worse than that. Measuring > contention on that lock hides that problem, since having shared buffers lock up for 100ms or more but only occasionally is a huge problem, even if it > doesn't occur frequently enough for the averaged contention to show as an issue. To optimize finding next victim buffer, I am planning to run the clock sweep in background. Apart from that do you have any idea to make it closer to O(1)? With Regards, Amit Kapila.
On 3 July 2013 12:56, Amit Kapila <amit.kapila@huawei.com> wrote:
-- To optimize finding next victim buffer, I am planning to run the clock>My perspectives here would be
> * BufFreelistLock is a huge issue. Finding a next victim block needs to be
an O(1) operation, yet it is currently much worse than that. Measuring
> contention on that lock hides that problem, since having shared buffers
lock up for 100ms or more but only occasionally is a huge problem, even if
it
> doesn't occur frequently enough for the averaged contention to show as an
issue.
sweep in background. Apart from that do you have any idea to make it closer
to O(1)?
Yes, I already posted patches to attentuate the search time. Please check back last few CFs of 9.3
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, July 03, 2013 6:10 PM Simon Riggs wrote: On 3 July 2013 12:56, Amit Kapila <amit.kapila@huawei.com> wrote: >>>My perspectives here would be >>> * BufFreelistLock is a huge issue. Finding a next victim block needs to be an O(1) operation, yet it is currently much worse than that. Measuring >>> contention on that lock hides that problem, since having shared buffers lock up for 100ms or more but only occasionally is a huge problem, even if it >>> doesn't occur frequently enough for the averaged contention to show as an issue. >> To optimize finding next victim buffer, I am planning to run the clock >> sweep in background. Apart from that do you have any idea to make it closer >> to O(1)? > Yes, I already posted patches to attentuate the search time. Please check back last few CFs of 9.3 Okay, I got it. I think you mean 9.2. Patch: Reduce locking on StrategySyncStart() https://commitfest.postgresql.org/action/patch_view?id=743 Patch: Reduce freelist locking during DROP TABLE/DROP DATABASE https://commitfest.postgresql.org/action/patch_view?id=744 I shall pay attention to patches and the discussion during my work on enhancement of this patch. With Regards, Amit Kapila.
On Friday, June 28, 2013 6:20 PM Robert Haas wrote: > On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> > wrote: > > Currently it wakes up based on bgwriterdelay config parameter which > is by > > default 200ms, so you means we should > > think of waking up bgwriter based on allocations and number of > elements left > > in freelist? > > I think that's what Andres and I are proposing, yes. > > > As per my understanding Summarization of points raised by you and > Andres > > which this patch should address to have a bigger win: > > > > 1. Bgwriter needs to be improved so that it can help in reducing > usage count > > and finding next victim buffer > > (run the clock sweep and add buffers to the free list). > > Check. I think one way to handle it is that while moving buffers to freelist, if we find that there are not enough buffers (>= high watermark) which have zero usage count, then move through buffer list and reduce usage count. Now here I think it is important how do we find that how many times we should circulate the buffer list to reduce usage count. Currently I have kept it proportional to number of times it failed to move enough buffers to freelist. > > 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist > are > > less. > > Check. The way to do this is to keep a variable in shared memory in > the same cache line as the spinlock protecting the freelist, and > update it when you update the free list. Added a new variable freelistLatch in BufferStrategyControl > > 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer > > (a spinlock for the freelist, and an lwlock for the clock sweep). > > Check. Added a new variable freelist_lck in BufferStrategyControl which will be used to protect freelist.Still Buffreelist will be used to protect clock sweep part of StrategyGetBuffer. > > 4. Separate processes for writing dirty buffers and moving buffers to > > freelist > > I think this part might be best pushed to a separate patch, although I > agree we probably need it. > > > 5. Bgwriter needs to be more aggressive, logic based on which it > calculates > > how many buffers it needs to process needs to be improved. > > This is basically overlapping with points already made. I suspect we > could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and > bgwriter_lru_multiplier altogether. The background writer would just > have a high and a low watermark. When the number of buffers on the > freelist drops below the low watermark, the allocating backend sets > the latch and bgwriter wakes up and begins adding buffers to the > freelist. When the number of buffers on the free list reaches the > high watermark, the background writer goes back to sleep. Some > experimentation might be needed to figure out what values are > appropriate for those watermarks. In theory this could be a > configuration knob, but I suspect it's better to just make the system > tune it right automatically. Currently in Patch I have used low watermark as 1/6 and high watermark as 1/3 of NBuffers. Values are hardcoded for now, but I will change to guc's or hash defines. As far as I can think there is no way to find number of buffers on freelist, so I had added one more variable to maintain it. Initially I thought that I could use existing variables firstfreebuffer and lastfreebuffer to calculate it, but it may not be accurate as once the buffers are moved to freelist, these don't give exact count. The main doubt here is what if after traversing all buffers, it didn't find enough buffers to meet high watermark? Currently I just move out of loop to move buffers and just try to reduce usage count as explained in point-1 > > 6. There can be contention around buffer mapping locks, but we can > focus on > > it later > > 7. cacheline bouncing around the buffer header spinlocks, is there > anything > > we can do to reduce this? > > I think these are points that we should leave for the future. This is just a WIP patch. I have kept older code in comments. I need to further refine it and collect performance data. I had prepared one script (perf_buff_mgmt.sh) to collect performance data for different shared buffers/scalefactor/number_of_clients Top level points which still needs to be taken care: 1. Choose Optimistically used buffer in StrategyGetBuffer(). Refer Simon's Patch: https://commitfest.postgresql.org/action/patch_view?id=743 2. Don't bump the usage count on every time buffer is pinned. This idea I got when reading archives about improvements in this area. With Regards, Amit Kapila.
Bump. I’m interested in many of the issues that were discussed in this thread. Was this patch ever wrapped up (I can’t find itin any CF), or did this thread die off? —Jason On Aug 6, 2013, at 12:18 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Friday, June 28, 2013 6:20 PM Robert Haas wrote: >> On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila <amit.kapila@huawei.com> >> wrote: >>> Currently it wakes up based on bgwriterdelay config parameter which >> is by >>> default 200ms, so you means we should >>> think of waking up bgwriter based on allocations and number of >> elements left >>> in freelist? >> >> I think that's what Andres and I are proposing, yes. >> >>> As per my understanding Summarization of points raised by you and >> Andres >>> which this patch should address to have a bigger win: >>> >>> 1. Bgwriter needs to be improved so that it can help in reducing >> usage count >>> and finding next victim buffer >>> (run the clock sweep and add buffers to the free list). >> >> Check. > I think one way to handle it is that while moving buffers to freelist, > if we find > that there are not enough buffers (>= high watermark) which have zero > usage count, > then move through buffer list and reduce usage count. Now here I think > it is important > how do we find that how many times we should circulate the buffer list > to reduce usage count. > Currently I have kept it proportional to number of times it failed to > move enough buffers to freelist. > >>> 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist >> are >>> less. >> >> Check. The way to do this is to keep a variable in shared memory in >> the same cache line as the spinlock protecting the freelist, and >> update it when you update the free list. > > > Added a new variable freelistLatch in BufferStrategyControl > >>> 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer >>> (a spinlock for the freelist, and an lwlock for the clock sweep). >> >> Check. > > Added a new variable freelist_lck in BufferStrategyControl which will be > used to protect freelist. > Still Buffreelist will be used to protect clock sweep part of > StrategyGetBuffer. > > > >>> 4. Separate processes for writing dirty buffers and moving buffers to >>> freelist >> >> I think this part might be best pushed to a separate patch, although I >> agree we probably need it. >> >>> 5. Bgwriter needs to be more aggressive, logic based on which it >> calculates >>> how many buffers it needs to process needs to be improved. >> >> This is basically overlapping with points already made. I suspect we >> could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and >> bgwriter_lru_multiplier altogether. The background writer would just >> have a high and a low watermark. When the number of buffers on the >> freelist drops below the low watermark, the allocating backend sets >> the latch and bgwriter wakes up and begins adding buffers to the >> freelist. When the number of buffers on the free list reaches the >> high watermark, the background writer goes back to sleep. Some >> experimentation might be needed to figure out what values are >> appropriate for those watermarks. In theory this could be a >> configuration knob, but I suspect it's better to just make the system >> tune it right automatically. > > Currently in Patch I have used low watermark as 1/6 and high watermark as > 1/3 of NBuffers. > Values are hardcoded for now, but I will change to guc's or hash defines. > As far as I can think there is no way to find number of buffers on freelist, > so I had added one more variable to maintain it. > Initially I thought that I could use existing variables firstfreebuffer and > lastfreebuffer to calculate it, but it may not be accurate as > once the buffers are moved to freelist, these don't give exact count. > > The main doubt here is what if after traversing all buffers, it didn't find > enough buffers to meet > high watermark? > > Currently I just move out of loop to move buffers and just try to reduce > usage count as explained in point-1 > >>> 6. There can be contention around buffer mapping locks, but we can >> focus on >>> it later >>> 7. cacheline bouncing around the buffer header spinlocks, is there >> anything >>> we can do to reduce this? >> >> I think these are points that we should leave for the future. > > This is just a WIP patch. I have kept older code in comments. I need to > further refine it and collect performance data. > I had prepared one script (perf_buff_mgmt.sh) to collect performance data > for different shared buffers/scalefactor/number_of_clients > > Top level points which still needs to be taken care: > 1. Choose Optimistically used buffer in StrategyGetBuffer(). Refer Simon's > Patch: > https://commitfest.postgresql.org/action/patch_view?id=743 > 2. Don't bump the usage count on every time buffer is pinned. This idea I > got when reading archives about > improvements in this area. > > With Regards, > Amit Kapila. > <changed_freelist_mgmt.patch><perf_buff_mgmt.sh>
On Sat, Feb 8, 2014 at 7:16 AM, Jason Petersen <jason@citusdata.com> wrote: > Bump. > > I'm interested in many of the issues that were discussed in this thread. Was this patch ever wrapped up (I can't find itin any CF), or did this thread die off? This and variant of this patch have been discussed multiple times, some of the CF entries are as below: Recent https://commitfest.postgresql.org/action/patch_view?id=1113 Previous https://commitfest.postgresql.org/action/patch_view?id=932 The main thing about this idea is to arrive with tests/scenario's where we can show the benefit of this patch. I didn't get time during 9.4 to work on this again, but might work on it in next version, if you could help with some scenarios/test where this patch can show benefit, it would be really good. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com