Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers
From | Yura Sokolov |
---|---|
Subject | Re: BufferAlloc: don't take two simultaneous locks |
Date | |
Msg-id | 4b4c81ecddd56fa974124c4e6da2f54aed3ddbf2.camel@postgrespro.ru Whole thread Raw |
In response to | Re: BufferAlloc: don't take two simultaneous locks (Yura Sokolov <y.sokolov@postgrespro.ru>) |
Responses |
Re: BufferAlloc: don't take two simultaneous locks
Re: BufferAlloc: don't take two simultaneous locks |
List | pgsql-hackers |
Good day, Kyotaoro-san. Good day, hackers. В Вс, 20/03/2022 в 12:38 +0300, Yura Sokolov пишет: > В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет: > > At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in > > > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет: > > > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in > > > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using > > > > the freelist_idx of the new key. v8 uses that of the old key (at the > > > > time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and > > > > returns the stashed)" case the stashed element is returned to its > > > > original partition. But it is not what I mentioned. > > > > > > > > On the other hand, once the stahsed element is reused by HASH_ENTER, > > > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow > > > > from old partition) case. I suspect that ththat the frequent freelist > > > > starvation comes from the latter case. > > > > > > Doubtfully. Due to probabilty theory, single partition doubdfully > > > will be too overflowed. Therefore, freelist. > > > > Yeah. I think so generally. > > > > > But! With 128kb shared buffers there is just 32 buffers. 32 entry for > > > 32 freelist partition - certainly some freelist partition will certainly > > > have 0 entry even if all entries are in freelists. > > > > Anyway, it's an extreme condition and the starvation happens only at a > > neglegible ratio. > > > > > > RETURNED: 2 > > > > ALLOCED: 0 > > > > BORROWED: 435 > > > > REUSED: 495444 > > > > ASSIGNED: 495467 (-23) > > > > > > > > Now "BORROWED" happens 0.8% of REUSED > > > > > > 0.08% actually :) > > > > Mmm. Doesn't matter:p > > > > > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675. > > > > > > ... > > > > > > > Strange thing: both master and patched version has higher > > > > > > > peak tps at X5676 at medium connections (17 or 27 clients) > > > > > > > than in first october version [1]. But lower tps at higher > > > > > > > connections number (>= 191 clients). > > > > > > > I'll try to bisect on master this unfortunate change. > > ... > > > I've checked. Looks like something had changed on the server, since > > > old master commit behaves now same to new one (and differently to > > > how it behaved in October). > > > I remember maintainance downtime of the server in november/december. > > > Probably, kernel were upgraded or some system settings were changed. > > > > One thing I have a little concern is that numbers shows 1-2% of > > degradation steadily for connection numbers < 17. > > > > I think there are two possible cause of the degradation. > > > > 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER. > > This might cause degradation for memory-contended use. > > > > 2. nallocs operation might cause degradation on non-shared dynahasyes? > > I believe doesn't but I'm not sure. > > > > On a simple benchmarking with pgbench on a laptop, dynahash > > allocation (including shared and non-shared) happend about at 50 > > times per second with 10 processes and 200 with 100 processes. > > > > > > I don't think nalloced needs to be the same width to long. For the > > > > platforms with 32-bit long, anyway the possible degradation if any by > > > > 64-bit atomic there doesn't matter. So don't we always define the > > > > atomic as 64bit and use the pg_atomic_* functions directly? > > > > > > Some 32bit platforms has no native 64bit atomics. Then they are > > > emulated with locks. > > > > > > Well, and for 32bit platform long is just enough. Why spend other > > > 4 bytes per each dynahash? > > > > I don't think additional bytes doesn't matter, but emulated atomic > > operations can matter. However I'm not sure which platform uses that > > fallback implementations. (x86 seems to have __sync_fetch_and_add() > > since P4). > > > > My opinion in the previous mail is that if that level of degradation > > caued by emulated atomic operations matters, we shouldn't use atomic > > there at all since atomic operations on the modern platforms are not > > also free. > > > > In relation to 2 above, if we observe that the degradation disappears > > by (tentatively) use non-atomic operations for nalloced, we should go > > back to the previous per-freelist nalloced. > > Here is version with nalloced being union of appropriate atomic and > long. > Ok, I got access to stronger server, did the benchmark, found weird things, and so here is new version :-) First I found if table size is strictly limited to NBuffers and FIXED, then under high concurrency get_hash_entry may not find free entry despite it must be there. It seems while process scans free lists, other concurrent processes "moves entry around", ie one concurrent process fetched it from one free list, other process put new entry in other freelist, and unfortunate process missed it since it tests freelists only once. Second, I confirm there is problem with freelist spreading. If I keep entry's freelist_idx, then one freelist is crowded. If I use new entry's freelist_idx, then one freelist is emptified constantly. Third, I found increased concurrency could harm. When popular block is evicted for some reason, then thundering herd effect occures: many backends wants to read same block, they evict many other buffers, but only one is inserted. Other goes to freelist. Evicted buffers by itself reduce cache hit ratio and provocates more work. Old version resists this effect by not removing old buffer before new entry is successfully inserted. To fix this issues I made following: # Concurrency First, I limit concurrency by introducing other lwlocks tranche - BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs 128). If backend doesn't find buffer in buffer table and wants to introduce it, it first calls LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE) If lock were acquired, then it goes to eviction and replace process. Otherwise, it waits lock to be released and repeats search. This greately improve performance for > 400 clients in pgbench. I tried other variant as well: - first insert entry with dummy buffer index into buffer table. - if such entry were already here, then wait it to be filled. - otherwise find victim buffer and replace dummy index with new one. Wait were done with shared lock on EvictPartitionLock as well. This variant performed quite same. Logically I like that variant more, but there is one gotcha: FlushBuffer could fail with elog(ERROR). Therefore then there is a need to reliable remove entry with dummy index. And after all, I still need to hold EvictPartitionLock to notice waiters. I've tried to use ConditionalVariable, but its performance were much worse. # Dynahash capacity and freelists. I returned back buffer table initialization: - removed FIXES_SIZE restriction introduced in previous version - returned `NBuffers + NUM_BUFFER_PARTITIONS`. I really think, there should be more spare items, since almost always entry_alloc is called at least once (on 128MB shared_buffers). But let keep it as is for now. `get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists before falling back to `entry_alloc`, and probing is changed from linear to quadratic. This greately reduces number of calls to `entry_alloc`, so more shared memory left intact. And I didn't notice large performance hit from. Probably there is some, but I think it is adequate trade-off. `free_reused_entry` now returns entry to random position. It flattens free entry's spread. Although it is not enough without other changes (thundering herd mitigation and probing more lists in get_hash_entry). # Benchmarks Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz 18 cores per socket + hyper-threading - upto 72 virtual core total. turbo-boost disabled Linux 5.10.103-1 Debian. pgbench scale 100 simple_select + simple select with 3 keys (sql file attached). shared buffers 128MB & 1GB huge_pages=on 1 socket conns | master | patch-v11 | master 1G | patch-v11 1G --------+------------+------------+------------+------------ 1 | 27882 | 27738 | 32735 | 32439 2 | 54082 | 54336 | 64387 | 63846 3 | 80724 | 81079 | 96387 | 94439 5 | 134404 | 133429 | 160085 | 157399 7 | 185977 | 184502 | 219916 | 217142 17 | 335345 | 338214 | 393112 | 388796 27 | 393686 | 394948 | 447945 | 444915 53 | 572234 | 577092 | 678884 | 676493 83 | 558875 | 561689 | 669212 | 655697 107 | 553054 | 551896 | 654550 | 646010 139 | 541263 | 538354 | 641937 | 633840 163 | 532932 | 531829 | 635127 | 627600 191 | 524647 | 524442 | 626228 | 617347 211 | 521624 | 522197 | 629740 | 613143 239 | 509448 | 554894 | 652353 | 652972 271 | 468190 | 557467 | 647403 | 661348 307 | 454139 | 558694 | 642229 | 657649 353 | 446853 | 554301 | 635991 | 654571 397 | 441909 | 549822 | 625194 | 647973 1 socket 3 keys conns | master | patch-v11 | master 1G | patch-v11 1G --------+------------+------------+------------+------------ 1 | 16677 | 16477 | 22219 | 22030 2 | 32056 | 31874 | 43298 | 43153 3 | 48091 | 47766 | 64877 | 64600 5 | 78999 | 78609 | 105433 | 106101 7 | 108122 | 107529 | 148713 | 145343 17 | 205656 | 209010 | 272676 | 271449 27 | 252015 | 254000 | 323983 | 323499 53 | 317928 | 334493 | 446740 | 449641 83 | 299234 | 327738 | 437035 | 443113 107 | 290089 | 322025 | 430535 | 431530 139 | 277294 | 314384 | 422076 | 423606 163 | 269029 | 310114 | 416229 | 417412 191 | 257315 | 306530 | 408487 | 416170 211 | 249743 | 304278 | 404766 | 416393 239 | 243333 | 310974 | 397139 | 428167 271 | 236356 | 309215 | 389972 | 427498 307 | 229094 | 307519 | 382444 | 425891 353 | 224385 | 305366 | 375020 | 423284 397 | 218549 | 302577 | 364373 | 420846 2 sockets conns | master | patch-v11 | master 1G | patch-v11 1G --------+------------+------------+------------+------------ 1 | 27287 | 27631 | 32943 | 32493 2 | 52397 | 54011 | 64572 | 63596 3 | 76157 | 80473 | 93363 | 93528 5 | 127075 | 134310 | 153176 | 149984 7 | 177100 | 176939 | 216356 | 211599 17 | 379047 | 383179 | 464249 | 470351 27 | 545219 | 546706 | 664779 | 662488 53 | 728142 | 728123 | 857454 | 869407 83 | 918276 | 957722 | 1215252 | 1203443 107 | 884112 | 971797 | 1206930 | 1234606 139 | 822564 | 970920 | 1167518 | 1233230 163 | 788287 | 968248 | 1130021 | 1229250 191 | 772406 | 959344 | 1097842 | 1218541 211 | 756085 | 955563 | 1077747 | 1209489 239 | 732926 | 948855 | 1050096 | 1200878 271 | 692999 | 941722 | 1017489 | 1194012 307 | 668241 | 920478 | 994420 | 1179507 353 | 642478 | 908645 | 968648 | 1174265 397 | 617673 | 893568 | 950736 | 1173411 2 sockets 3 keys conns | master | patch-v11 | master 1G | patch-v11 1G --------+------------+------------+------------+------------ 1 | 16722 | 16393 | 20340 | 21813 2 | 32057 | 32009 | 39993 | 42959 3 | 46202 | 47678 | 59216 | 64374 5 | 78882 | 72002 | 98054 | 103731 7 | 103398 | 99538 | 135098 | 135828 17 | 205863 | 217781 | 293958 | 299690 27 | 283526 | 290539 | 414968 | 411219 53 | 336717 | 356130 | 460596 | 474563 83 | 307310 | 342125 | 419941 | 469989 107 | 294059 | 333494 | 405706 | 469593 139 | 278453 | 328031 | 390984 | 470553 163 | 270833 | 326457 | 384747 | 470977 191 | 259591 | 322590 | 376582 | 470335 211 | 263584 | 321263 | 375969 | 469443 239 | 257135 | 316959 | 370108 | 470904 271 | 251107 | 315393 | 365794 | 469517 307 | 246605 | 311585 | 360742 | 467566 353 | 236899 | 308581 | 353464 | 466936 397 | 249036 | 305042 | 344673 | 466842 I skipped v10 since I used it internally for variant "insert entry with dummy index then search victim". ------ regards Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Attachment
pgsql-hackers by date: