Re: [PING] fallocate() causes btrfs to never compress postgresql files - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [PING] fallocate() causes btrfs to never compress postgresql files |
Date | |
Msg-id | CA+hUKGJT+jczya=2s4D4GYXVWOFJ-qAkJDkjVPk-PAbqNL3x9A@mail.gmail.com Whole thread Raw |
In response to | Re: [PING] fallocate() causes btrfs to never compress postgresql files (Dimitrios Apostolou <jimis@gmx.net>) |
List | pgsql-hackers |
On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote: > On Sun, 1 Jun 2025, Thomas Munro wrote: > > Or for a completely different approach: I wonder if ftruncate() would > > be more efficient on COW systems anyway. The minimum thing we need is > > for the file system to remember the new size, 'cause, erm, we don't. > > All the rest is probably a waste of cycles, since they reserve real > > space (or fail to) later in the checkpointer or whatever process > > eventually writes the data out. > > FWIW I asked the btrfs devs. From > https://github.com/kdave/btrfs-progs/pull/976 > I quote Qu Wenruo: > > > Only for falloc(), not ftruncate(). > > > > The PREALLOC inode flag is added for any preallocated file extent, > > meanwhile truncate only creates holes. > > > > truncate is fast but it's really different from fallocate by there is > > nothing really allocated. > > > > This means the later writes will need to allocate their own data > > extents. This is fine and even preferred for btrfs, but may lead to > > performance drop for more traditional fses. > > > > We're in an era that fs features are not longer that generic, fallocate > > is just one example, in fact fallocate will cause more problems more > > than no compression. > > > > It's really a deep rabbit hole, and is not something simple true or > > false questions. > > > In other words, btrfs will not try to allocate anything with ftruncate(), > it will just mark the new space as a "hole". As such, the file is not > marked as "PREALLOC" which is what disables compression. Of course there > is no guarantee that further writes will succeed, and as quoted above, > other (non-COW) filesystems might be slower writing the > ftruncate()-allocated space. Yeah, right, I know. But PostgreSQL has at least two different goals when extending a relation: 1. Remember the new size of the relation somewhere*. 2. Reserve space now, so that we can report ENOSPC and roll back the transaction that wants to extend the relation when the disk is full, instead of causing a checkpoint or buffer eviction to fail later (see https://wiki.postgresql.org/wiki/ENOSPC for longer version). But the second thing just can't work on a COW system by definition, so the whole notion is bogus, which is why I wondered if fruncate() is actually a reasonable option to have, even though it just creates holes (on Unixen). I also know of another completely different reason to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS supports holes via other syscalls, but ftruncate() or rather _chsize_s() as they spell it doesn't make them), making it more like posix_fallocate() in this usage. So I was beginning to wonder if we might want to experiment with a patch that adds file_extend_method=fallocate,ftruncate,write. Perhaps accompanied by a threshold setting below which it always writes. Then we could experiment with various COW file systems (zfs, btrfs, apfs, refs, ???) and NTFS to see how that speculation works out in reality. Wild speculation: To actually achieve the second thing on a COW file system, you'd probably need some totally new kind of interface, because that POSIX interface has the wrong shape. I have wondered about a new fcntl() or whatever that would let you reserve the right to write N blocks (ie just once!) without ENOSPC on a given descriptor, that a database could conceptually acquire when dirtying buffers, since that's the point at which we know that a write must eventually happen (then probably amortise that accounting a lot), including but not limited to this relation-extension case, and that way you could achieve goal #2, ie transferring ENOSPC errors to transaction time. But that's just a daydream about vapourware. One problem is that PostgreSQL has many processes with separate file descriptors, so that'd make the bookkeeping trickier but not impossible. (*That has a few known issues...)
pgsql-hackers by date: