Re: Purpose of wal_init_zero - Mailing list pgsql-hackers

From Ritu Bhandari
Subject Re: Purpose of wal_init_zero
Date
Msg-id CAPNLunXuOc_Oyrr-pRVRcjCwV-G28vC2g6P-23BjVxRoNf9vRg@mail.gmail.com
Whole thread Raw
In response to Re: Purpose of wal_init_zero  (Andy Fan <zhihuifan1213@163.com>)
Responses Re: Purpose of wal_init_zero
Re: Purpose of wal_init_zero
List pgsql-hackers
Hi, 

Adding to Andy Fan's point above:

If we increase WAL segment size from 16MB to 64MB, initializing the 64MB WAL segment inline can cause several seconds of freeze on all write transactions when it happens. Writing out a newly zero-filled 64MB WAL segment takes several seconds for smaller disk sizes. 

Disk size (GB)throughput per GiB (MiBps)throughput (MiBpsTime to write 64MB, seconds
100.48513.33
320.48154.17
640.48312.08
1280.48611.04
2560.481230.52
5000.482400.27
8340.484000.16
1,0000.484800.13


Writing full 64MB zeroes every WAL file switch will not just cause general performance degradation, but more concerningly also makes the workload more "jittery", by stopping all WAL writes, so all write workloads, at every WAL switch for the time it takes to zero-fill.

Also about WAL recycle, during our performance benchmarking, we noticed that high volume of updates or inserts will tend to generate WAL faster than standard checkpoint processes can keep up resulting in increased WAL file creation (instead of rotation) and zero-filling, which significantly degrades performance. 

I see, PG once had fallocate [1] (which was reverted by [2] due to some performance regression concern). The original OSS discussion was in [3]. 
The perf regression was reported in [4]. Looks like this was due to how ext4 handled extents and uninitialized data[5] and that seems to be fixed in [6]. I'll check with Theodore Ts'o to confirm on [6].


Thanks,
-Ritu

On Thu, 16 Jan 2025 at 12:01, Andy Fan <zhihuifan1213@163.com> wrote:

Hi,

>
> c=1 && \
>   psql -c checkpoint -c 'select pg_switch_wal()' && \
>   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000
>
> wal_init_zero = 1: 885 TPS
> wal_init_zero = 0: 286 TPS.

Your theory looks clear and the result is promsing. I can reproduce the
similar result in my setup.

on: tps = 1588.538378 (without initial connection time)
off: tps = 857.755343 (without initial connection time) 

> Of course I chose this case to be intentionally extreme - each transaction
> fills a bit more than one page of WAL and immediately flushes it. That
> guarantees that each commit needs a seperate filesystem metadata flush and a
> flush of the data for the fdatasync() at commit.

However if I increase the clients from 1 to 64(this may break this
extrme because of group commit) then we can see the wal_init_zero caused
noticable regression. 

c=64 && \
   psql -c checkpoint -c 'select pg_switch_wal()' && \
   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000

off:
tps = 12135.110730 (without initial connection time)
tps = 11964.016277 (without initial connection time)
tps = 12078.458724 (without initial connection time)

on:
tps = 9392.374563 (without initial connection time)
tps = 9391.916410 (without initial connection time)
tps = 9390.503777 (without initial connection time)

Now the wal_init_zero happens on the user backend and other backends also
need to wait for it, this looks not good to me. I find walwriter doesn't
do much things, I'd like to have a try if we can offload wal_init_zero
to the walwriter.

About the wal_recycle, IIUC, it can only recycle a wal file during
Checkpoint, but checkpoint doesn't happens often.

--
Best Regards
Andy Fan



pgsql-hackers by date:

Previous
From: Vladlen Popolitov
Date:
Subject: Re: SQL/JSON json_table plan clause
Next
From: Shlok Kyal
Date:
Subject: Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.