Thread: [PING] fallocate() causes btrfs to never compress postgresql files

[PING] fallocate() causes btrfs to never compress postgresql files

From
Dimitrios Apostolou
Date:
Hello, sorry for mass sending this, but I didn't get any response to my
first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and the
reviewers. I think it's an important issue, because I need to
custom-compile postgresql to have what I had before: a transparently
compressed database.

[1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a-aacf-00f86450e738@gmx.net
[2] https://github.com/postgres/postgres/commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8

My previous message follows:

Hi,

this is just a heads-up about files being generated by PostgreSQL 17 not
being compressed by Btrfs, even when mounted with the force-compress mount
option. I have this occuring aggressively when restoring a database via
pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
which in turn invokes posix_fallocate().

I also verified that turning off the use of fallocate causes the database
to write compressed files again, like it did in older versions.
Unfortunately the only way I found was to configure with a "hack" so that
autoconf thinks the feature is not available:

    ./configure ac_cv_func_posix_fallocate=no

There have been discussions on the btrfs mailing list about why it does
that, the summary is that it is very difficult to guarantee that
compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
files with fallocate()d ranges are treated as being marked NOCOW,
effectively disabling compression.

Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).

Regards,
Dimitris



On 5/28/25 16:22, Dimitrios Apostolou wrote:
> Hello, sorry for mass sending this, but I didn't get any response to my
> first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and
> the reviewers. I think it's an important issue, because I need to
> custom-compile postgresql to have what I had before: a transparently
> compressed database.
> 

That message arrived a couple days before the feature freeze, so
everyone was busy with getting PG18 patches over the line. I assume
that's why no one responded to a message about an issue that already
affects PG17. We're in the quieter part of the dev cycle, people are
recovering etc. Hence the delay.

> [1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a-
> aacf-00f86450e738@gmx.net
> [2] https://github.com/postgres/postgres/
> commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8
> 
> My previous message follows:
> 
> Hi,
> 
> this is just a heads-up about files being generated by PostgreSQL 17 not
> being compressed by Btrfs, even when mounted with the force-compress mount
> option. I have this occuring aggressively when restoring a database via
> pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
> which in turn invokes posix_fallocate().
> 

Right, I don't think we're really using posix_fallocate() in other
places, or at least not in places that would matter. And this code comes
from commit 4d330a61bb in PG17:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4d330a61bb1969df31f2cebfe1ba9d1d004346d8

The commit message explains why we do that - it has advantages when
allocating large number of blocks. FWIW it's a general code, when we
need to add space to a relation, not just for pg_restore.


> I also verified that turning off the use of fallocate causes the database
> to write compressed files again, like it did in older versions.
> Unfortunately the only way I found was to configure with a "hack" so that
> autoconf thinks the feature is not available:
> 
>    ./configure ac_cv_func_posix_fallocate=no
> 

Unfortunately, that seems pretty heavy handed, because it will affect
the whole build, no matter which filesystem it gets used with. And I
guess we don't want to disable posix_fallocate() just because one
filesystem does something ... strange.

> There have been discussions on the btrfs mailing list about why it does
> that, the summary is that it is very difficult to guarantee that
> compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
> files with fallocate()d ranges are treated as being marked NOCOW,
> effectively disabling compression.
> 

Isn't guaranteeing success of a write a general issue with compressed
filesystem? Why is posix_fallocate() any special in this regard?
Shouldn't the filesystem be defensive and assume the data is not
compressible? Or maybe just return EOPNOTSUPP when in doubt.

> Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
> it the filesystem at fault for not returning EOPNOTSUPP, in which case
> postgres would use its fallback code?
> 

I don't have a clear opinion on whether it's a filesystem issue. Maybe
we should be handling this differently, not sure.

> BTW even in the last case, PostgreSQL would not notice the lack of
> fallocate() support as glibc implements a userspace fallback in
> posix_fallocate(). That fallback has its own issues that hopefully will
> not affect postgres (see CAVEATS in man 3 posix_fallocate).
> 

Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
userspace fallback, we wouldn't notice. But that's up to the btrfs to
decide if they want to support fallocate. We still need our fallback
anyway, because of other OSes.


regards

-- 
Tomas Vondra




Thomas Munro <thomas.munro@gmail.com> writes:
> It's slightly tricky to get smgr to behave differently because of the
> contents of a system catalogue!

The mere thought makes me blanch.  I'm okay with the GUC part,
but I do not think we should put in 0002 --- the odds of
causing serious problems greatly outweigh the value, IMO.
Fundamental layering violations tend to bite you on tender
parts of your anatomy.

            regards, tom lane