Home > mailing lists

Re: Purpose of wal_init_zero - Mailing list pgsql-hackers

From	Ritu Bhandari
Subject	Re: Purpose of wal_init_zero
Date	January 16 12:20:57
Msg-id	CAPNLunXuOc_Oyrr-pRVRcjCwV-G28vC2g6P-23BjVxRoNf9vRg@mail.gmail.com Whole thread Raw
In response to	Re: Purpose of wal_init_zero (Andy Fan <zhihuifan1213@163.com>)
Responses	Re: Purpose of wal_init_zero Re: Purpose of wal_init_zero
List	pgsql-hackers

Tree view

Hi,

Adding to Andy Fan's point above:

If we increase WAL segment size from 16MB to 64MB, initializing the 64MB WAL segment inline can cause several seconds of freeze on all write transactions when it happens. Writing out a newly zero-filled 64MB WAL segment takes several seconds for smaller disk sizes.

Disk size (GB)	throughput per GiB (MiBps)	throughput (MiBps	Time to write 64MB, seconds
10	0.48	5	13.33
32	0.48	15	4.17
64	0.48	31	2.08
128	0.48	61	1.04
256	0.48	123	0.52
500	0.48	240	0.27
834	0.48	400	0.16
1,000	0.48	480	0.13

Writing full 64MB zeroes every WAL file switch will not just cause general performance degradation, but more concerningly also makes the workload more "jittery", by stopping all WAL writes, so all write workloads, at every WAL switch for the time it takes to zero-fill.

Also about WAL recycle, during our performance benchmarking, we noticed that high volume of updates or inserts will tend to generate WAL faster than standard checkpoint processes can keep up resulting in increased WAL file creation (instead of rotation) and zero-filling, which significantly degrades performance.

I see, PG once had fallocate [1] (which was reverted by [2] due to some performance regression concern). The original OSS discussion was in [3].
The perf regression was reported in [4]. Looks like this was due to how ext4 handled extents and uninitialized data[5] and that seems to be fixed in [6]. I'll check with Theodore Ts'o to confirm on [6].

Could we consider adding back fallocate?

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=269e780
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b571bb
[3] https://www.postgresql.org/message-id/flat/CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB%2B-faQ%40mail.gmail.com
[4] https://www.postgresql.org/message-id/flat/CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw%40mail.gmail.com

[5] https://www.postgresql.org/message-id/CAKuK5J3R-oBh%2B9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ%40mail.gmail.com

[6] https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

Thanks,

-Ritu

On Thu, 16 Jan 2025 at 12:01, Andy Fan <zhihuifan1213@163.com> wrote:

Hi,

>
> c=1 && \
> psql -c checkpoint -c 'select pg_switch_wal()' && \
> pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000
>
> wal_init_zero = 1: 885 TPS
> wal_init_zero = 0: 286 TPS.

Your theory looks clear and the result is promsing. I can reproduce the
similar result in my setup.

on: tps = 1588.538378 (without initial connection time)
off: tps = 857.755343 (without initial connection time)

> Of course I chose this case to be intentionally extreme - each transaction
> fills a bit more than one page of WAL and immediately flushes it. That
> guarantees that each commit needs a seperate filesystem metadata flush and a
> flush of the data for the fdatasync() at commit.

However if I increase the clients from 1 to 64(this may break this
extrme because of group commit) then we can see the wal_init_zero caused
noticable regression.

c=64 && \
psql -c checkpoint -c 'select pg_switch_wal()' && \
pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000

off:
tps = 12135.110730 (without initial connection time)
tps = 11964.016277 (without initial connection time)
tps = 12078.458724 (without initial connection time)

on:
tps = 9392.374563 (without initial connection time)
tps = 9391.916410 (without initial connection time)
tps = 9390.503777 (without initial connection time)

Now the wal_init_zero happens on the user backend and other backends also
need to wait for it, this looks not good to me. I find walwriter doesn't
do much things, I'd like to have a try if we can offload wal_init_zero
to the walwriter.

About the wal_recycle, IIUC, it can only recycle a wal file during
Checkpoint, but checkpoint doesn't happens often.

--
Best Regards
Andy Fan

pgsql-hackers by date:

From: Vladlen Popolitov
Date: 16 January, 12:18:04
Subject: Re: SQL/JSON json_table plan clause

From: Shlok Kyal
Date: 16 January, 12:45:20
Subject: Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

Re: Purpose of wal_init_zero - Mailing list pgsql-hackers

Previous

Next