Thread: Lockless queue of waiters in LWLock

Lockless queue of waiters in LWLock

From

Pavel Borisov

Date:

31 October 2022, 10:38:23

Hi, hackers!

When we take LWlock, we already use atomic CAS operation to atomically modify the lock state even in the presence of concurrent lock-takers. But if we can not take the lock immediately, we need to put the waiters on a waiting list, and currently, this operation is done not atomically and in a complicated way:
- Parts are done under LWLockWaitListLock, which includes a spin delay to take.
- Also, we need a two-step procedure to immediately dequeue if, after adding a current process into a wait queue, it appears that we don’t need to (and can not!) sleep as the lock is already free.

If the lock state contains references to the queue head and tail, we can implement a lockless queue of waiters for the LWLock. Adding new items to the queue head or tail can be done with a single CAS operation (adding to the tail will also require further setting the reference from the previous tail). Given that there could be only one lock releaser, which wakes up waiters in the queue, we can handle all the concurrency issues with reasonable complexity.

Removing the queue spinlock bit and the corresponding contention should give the performance gain on high concurrent LWLock usage scenarios.

Currently, the maximum size of procarray is 2^18 (as stated in buf_internals.h), so with the use of the 64-bit LWLock state variable, we can store the procarray indexes for both head and tail of the queue, all the existing machinery, and even have some spare bits for future flags.

Thus we almost entirely avoid spinlocks and replace them with repeated try to change the lock state if the CAS operation is unsuccessful due to concurrent state modification.

The attached patch implements described approach. The patch is based on the developments of Alexander Korotkov in the OrioleDB engine. I made further adoption of those ideas with guidance from him.

We did some preliminary performance checks and saw that the concurrent insert-only and txid_current tests with ~1000 connections pgbench significantly increase performance. The scripts for testing and results are attached. I’ve done the tests on 32-vcore x86-64 AWS machine using tmpfs as storage for the database to avoid random delays related to disk IO.

Though recently, Andres proposed a different patch, which just evades O(N) removal from the queue of waiters but improves performance even more [1]. The results of the comparison of the master branch, lockless queue (current) patch, and Andres’ patch are below. Please, take into account that the horizontal axis uses a log scale.

---------------------------------------
cat insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);

pgbench -d postgres -i -s 1 --unlogged-tables
echo -e "max_connections=2500\nmax_wal_senders=0\nwal_level=minimal\nmax_wal_size = 10GB\nshared_buffers = 20000MB\nautovacuum = off\nfsync = off\nfull_page_writes = off\nmax_worker_processes = 1024\nmax_parallel_workers = 1024\n" > ./pgdata$v/postgresql.auto.conf
psql -dpostgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39 43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300 330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500 1600 1800 2000

do
psql -dpostgres -c "truncate pgbench_accounts;"
psql -dpostgres -c "vacuum full pgbench_accounts;"
pgbench postgres -f insert.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns --no-vacuum | grep tps
done

--------------------------------------------------

cat txid.sql
select txid_current();

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39 43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300 330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500 1600 1800 2000

do
pgbench postgres -f txid.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns --no-vacuum | grep tps
done

-----------------------------------------------------

I can not understand why the performance of a lockless queue patch has a minor regression in the region of around 20 connections, even when compared to the current master branch.

Are there some scenarios where the lockless queue approach is superior? I expected they should be, at least in theory. Probably, there is a way to improve the attached patch further to achieve that superiority.

Best regards,
Pavel Borisov,
Supabase.

[1] https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Thread: Lockless queue of waiters in LWLock

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment