Re: Speed up Clog Access by increasing CLOG buffers - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Speed up Clog Access by increasing CLOG buffers |
Date | |
Msg-id | acf46406-cf6d-41fe-4118-4e1c960b4790@2ndquadrant.com Whole thread Raw |
In response to | Re: Speed up Clog Access by increasing CLOG buffers (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Speed up Clog Access by increasing CLOG buffers
|
List | pgsql-hackers |
On 10/20/2016 09:36 AM, Dilip Kumar wrote: > On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree with these conclusions. I had a chance to talk with Andres >> this morning at Postgres Vision and based on that conversation I'd >> like to suggest a couple of additional tests: >> >> 1. Repeat this test on x86. In particular, I think you should test on >> the EnterpriseDB server cthulhu, which is an 8-socket x86 server. > > I have done my test on cthulhu, basic difference is that In POWER we > saw ClogControlLock on top at 96 and more client with 300 scale > factor. But, on cthulhu at 300 scale factor transactionid lock is > always on top. So I repeated my test with 1000 scale factor as well on > cthulhu. > > All configuration is same as my last test. > > Test with 1000 scale factor > ------------------------------------- > > Test1: number of clients: 192 > > Head: > tps = 21206.108856 (including connections establishing) > tps = 21206.245441 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt > 310489 LWLockNamed | CLogControlLock > 296152 | > 35537 Lock | transactionid > 15821 LWLockTranche | buffer_mapping > 10342 LWLockTranche | buffer_content > 8427 LWLockTranche | clog > 3961 > 3165 Lock | extend > 2861 Lock | tuple > 2781 LWLockNamed | ProcArrayLock > 1104 LWLockNamed | XidGenLock > 745 LWLockTranche | lock_manager > 371 LWLockNamed | CheckpointerCommLock > 70 LWLockTranche | wal_insert > 5 BufferPin | BufferPin > 3 LWLockTranche | proc > > Patch: > tps = 28725.038933 (including connections establishing) > tps = 28725.367102 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt > 540061 | > 57810 LWLockNamed | CLogControlLock > 36264 LWLockTranche | buffer_mapping > 29976 Lock | transactionid > 4770 Lock | extend > 4735 LWLockTranche | clog > 4479 LWLockNamed | ProcArrayLock > 4006 > 3955 LWLockTranche | buffer_content > 2505 LWLockTranche | lock_manager > 2179 Lock | tuple > 1977 LWLockNamed | XidGenLock > 905 LWLockNamed | CheckpointerCommLock > 222 LWLockTranche | wal_insert > 8 LWLockTranche | proc > > Test2: number of clients: 96 > > Head: > tps = 25447.861572 (including connections establishing) > tps = 25448.012739 (excluding connections establishing) > 261611 | > 69604 LWLockNamed | CLogControlLock > 6119 Lock | transactionid > 4008 > 2874 LWLockTranche | buffer_mapping > 2578 LWLockTranche | buffer_content > 2355 LWLockNamed | ProcArrayLock > 1245 Lock | extend > 1168 LWLockTranche | clog > 232 Lock | tuple > 217 LWLockNamed | CheckpointerCommLock > 160 LWLockNamed | XidGenLock > 158 LWLockTranche | lock_manager > 78 LWLockTranche | wal_insert > 5 BufferPin | BufferPin > > Patch: > tps = 32708.368938 (including connections establishing) > tps = 32708.765989 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt > 326601 | > 7471 LWLockNamed | CLogControlLock > 5387 Lock | transactionid > 4018 > 3331 LWLockTranche | buffer_mapping > 3144 LWLockNamed | ProcArrayLock > 1372 Lock | extend > 722 LWLockTranche | buffer_content > 393 LWLockNamed | XidGenLock > 237 LWLockTranche | lock_manager > 234 Lock | tuple > 194 LWLockTranche | clog > 96 Lock | relation > 88 LWLockTranche | wal_insert > 34 LWLockNamed | CheckpointerCommLock > > Test3: number of clients: 64 > > Head: > > tps = 28264.194438 (including connections establishing) > tps = 28264.336270 (excluding connections establishing) > > 218264 | > 10314 LWLockNamed | CLogControlLock > 4019 > 2067 Lock | transactionid > 1950 LWLockTranche | buffer_mapping > 1879 LWLockNamed | ProcArrayLock > 592 Lock | extend > 565 LWLockTranche | buffer_content > 222 LWLockNamed | XidGenLock > 143 LWLockTranche | clog > 131 LWLockNamed | CheckpointerCommLock > 63 LWLockTranche | lock_manager > 52 Lock | tuple > 35 LWLockTranche | wal_insert > > Patch: > tps = 27906.376194 (including connections establishing) > tps = 27906.531392 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt > 228108 | > 4039 > 2294 Lock | transactionid > 2116 LWLockTranche | buffer_mapping > 1757 LWLockNamed | ProcArrayLock > 1553 LWLockNamed | CLogControlLock > 800 Lock | extend > 403 LWLockTranche | buffer_content > 92 LWLockNamed | XidGenLock > 74 LWLockTranche | lock_manager > 42 Lock | tuple > 35 LWLockTranche | wal_insert > 34 LWLockTranche | clog > 14 LWLockNamed | CheckpointerCommLock > > Test4: number of clients: 32 > > Head: > tps = 27587.999912 (including connections establishing) > tps = 27588.119611 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt > 117762 | > 4031 > 614 LWLockNamed | ProcArrayLock > 379 LWLockNamed | CLogControlLock > 344 Lock | transactionid > 183 Lock | extend > 102 LWLockTranche | buffer_mapping > 71 LWLockTranche | buffer_content > 39 LWLockNamed | XidGenLock > 25 LWLockTranche | lock_manager > 3 LWLockTranche | wal_insert > 3 LWLockTranche | clog > 2 LWLockNamed | CheckpointerCommLock > 2 Lock | tuple > > Patch: > tps = 28291.428848 (including connections establishing) > tps = 28291.586435 (excluding connections establishing) > [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt > 116596 | > 4041 > 757 LWLockNamed | ProcArrayLock > 407 LWLockNamed | CLogControlLock > 358 Lock | transactionid > 183 Lock | extend > 142 LWLockTranche | buffer_mapping > 77 LWLockTranche | buffer_content > 68 LWLockNamed | XidGenLock > 35 LWLockTranche | lock_manager > 15 LWLockTranche | wal_insert > 7 LWLockTranche | clog > 7 Lock | tuple > 4 LWLockNamed | CheckpointerCommLock > 1 Tuples only is on. > > Summary: > - At 96 and more clients count we can see ClogControlLock at the top. > - With patch contention on ClogControlLock is reduced significantly. > I think these behaviours are same as we saw on power. > > With 300 scale factor: > - Contention on ClogControlLock is significant only at 192 client > (still transaction id lock is on top), Which is completely removed > with group lock patch. > > For 300 scale factor, I am posting data only at 192 client count (If > anyone interested in other data I can post). > In the results you've posted on 10/12, you've mentioned a regression with 32 clients, where you got 52k tps on master but only 48k tps with the patch (so ~10% difference). I have no idea what scale was used for those tests, and I see no such regression in the current results (but you only report results for some of the client counts). Also, which of the proposed patches have you been testing? Can you collect and share a more complete set of data, perhaps based on the scripts I use to do tests on the large machine with 36/72 cores, available at https://bitbucket.org/tvondra/hp05-results ? I've taken some time to build a simple web-based reports from the results collected so far (also included in the git repository), and pushed them here: http://tvondra.bitbucket.org For each of the completed runs, there's a report comparing tps for different client counts with master and the three patches (average tps, median and stddev), and it's possible to download a more thorough text report with wait event stats, comparison of individual runs etc. If you want to cooperate on this, I'm available - i.e. I can help you get the tooling running, customize it etc. Regarding the results collected on the "big machine" so far, I do have a few observations: pgbench / scale 300 (fits into 16GB shared buffers) --------------------------------------------------- * in general, those results seem fine * the results generally fall into 3 categories (I'll show results for "pgbench -N" but regular pgbench behaves similarly): (a) logged, sync_commit=on - no impact http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip (b) logged, sync_commit=off - improvement http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip The thoughput gets improved by ~20% with 72 clients, and then it levels-off (but does not drop unlike on master).With high client counts the difference is up to 300%, but people who care about throughput won't run with suchclient counts anyway. And not only this improves throughput, it also significantly reduces variability of the performance (i.e. measurethroughput each second and compute STDDEV of that). You can imagine this as a much "smoother" chart of tps overtime. (c) unlogged, sync_commit=* - improvement http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip This is actually quite similar to (b). dilip / scale 300 (fits into 16GB shared buffers) ------------------------------------------------- * those results seem less OK * I haven't found any significant regressions (in the sense of significant performance drop compared to master), but the behavior in some cases seem fairly strange (and it's repeatable) * consider for example these results: http://tvondra.bitbucket.org/#dilip-300-unlogged-async http://tvondra.bitbucket.org/#dilip-300-logged-async * the saw-like pattern is rather suspicious, and I don't think I've seen anything like that before - I guess there's some feedback loop and we better find it before committing any of the patches, because this is something I don't want to see on any production machine (and I bet neither do you) * After looking into wait even details in the full text report at http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt (section "wait events for dilip-300-unlogged-async (runs combined)") I see that for pg-9.6-group-update, the statistics for 72, 108 and 144 clients (low - high - low), the summary looks likethis clients | wait_event_type | wait_event | wait_count | wait_pct ---------+-----------------+-----------------+------------+---------- 72 | | | 374845 | 62.87 72 | Client | ClientRead | 136320 | 22.86 72 | LWLockNamed | CLogControlLock| 52804 | 8.86 72 | LWLockTranche | buffer_content | 15337 | 2.57 72 |LWLockNamed | XidGenLock | 7352 | 1.23 72 | LWLockNamed | ProcArrayLock | 6630 | 1.11 108 | | | 407179 | 46.01 108 | LWLockNamed | CLogControlLock | 300452 | 33.95 108 | LWLockTranche | buffer_content | 87597 | 9.90 108 | Client | ClientRead | 80901 | 9.14 108 | LWLockNamed | ProcArrayLock | 3290 | 0.37 144 | | | 623057 | 53.44 144 | LWLockNamed | CLogControlLock | 175072 | 15.02 144 | Client | ClientRead | 163451 | 14.02 144 | LWLockTranche | buffer_content | 147963 | 12.69 144 | LWLockNamed | XidGenLock | 38361 | 3.29 144 |Lock | transactionid | 8821 | 0.76 That is, there's sudden jump on CLogControlLock from 22% to 33% and then back to 15% (and for 180 clients it jumps backto ~35%). That's pretty strange, and all the patches behave exactly the same. scale 3000 (45GB), shared_buffers=16GB --------------------------------------- For the small scale, the whole data set fits into 16GB shared buffers, so there were pretty much no writes except for WAL and CLOG. For scale 3000 that's no longer true - the backends will compete for buffers and will constantly write dirty buffers to page cache. I haven't realized this initially and the kernel was using the default vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this resulted in rather excessive threshold (or "insane" if you want), so the kernel regularly accumulated up to ~15GB of dirty data and then wrote it out in very short period of time. Even though the machine has fairly powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs), this lead to pretty bad latency spikes / drops in throughput. I've only done two runs with this configuration before realizing what's happening, the results are illustrated here" * http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes I'm not sure how important those results are (if throughput and smooth behavior matters, tuning the kernel thresholds is a must), but what I find interesting is that while the patches manage to improve throughput by 10-20%, they also (quite significantly) increase variability of the results (jitter in the tps over time). It's particularly visible on the pgbench results. I'm not sure that's a good tradeoff. After fixing the kernel page cache thresholds (by setting background_bytes to 256MB to perform smooth write-out), the effect differs depending on the workload: (a) dilip http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync - eliminates any impact of all the patches (b) pgbench (-N) http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip - By far the most severe regression observed during the testing. With 36 clients the throughput drops by ~40%, whichI think is pretty bad. Also the results are much more variable with the patches (compared to master). scale 3000 (45GB), shared_buffers=64GB --------------------------------------- I've also done some tests with increased shared buffers, so that even the large data set fits into them. Again, the results slightly depend on the workload: (a) dilip * http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64 * http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64 Pretty much no impact on throughput or variability. Unlike on the small data set, it the patches don't even eliminatethe performance drop above 72 clients - the performance closely matches master. (b) pgbench * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64 * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64 There's a small benefit (~20% on the same client count), and the performance drop only happens after 72 clients. The patchesalso significantly increase variability of the results, particularly for large client counts. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: