Re: Some thoughts on NFS - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Some thoughts on NFS |
Date | |
Msg-id | CA+hUKGJ3J_ZYKpOFM9EF2BOA8y71MfP5_ipLPsSwpB+dTt+GBQ@mail.gmail.com Whole thread Raw |
In response to | Re: Some thoughts on NFS (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Some thoughts on NFS
|
List | pgsql-hackers |
On Wed, Feb 20, 2019 at 5:52 AM Andres Freund <andres@anarazel.de> wrote: > > 1. Figure out how to get the ALLOCATE command all the way through the > > stack from PostgreSQL to the remote NFS server, and know for sure that > > it really happened. On the Debian buster Linux 4.18 system I checked, > > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate() > > appears to succeed but it doesn't really do anything at all (though I > > understand that some versions sometimes write zeros to simulate > > allocation, which in this case would be equally useless as it doesn't > > reserve anything on an NFS server). We need the server and NFS client > > and libc to be of the right version and cooperate and tell us that > > they have really truly reserved space, but there isn't currently a way > > as far as I can tell. How can we achieve that, without writing our > > own NFS client? > > > > 2. Deal with the resulting performance suckage. Extending 8kb at a > > time with synchronous network round trips won't fly. > > I think I'd just go for fsync();pwrite();fsync(); as the extension > mechanism, iff we're detecting a tablespace is on NFS. The first fsync() > to make sure there's no previous errors that we could mistake for > ENOSPC, the pwrite to extend, the second fsync to make sure there's > actually space. Then we can detect ENOSPC properly. That possibly does > leave some errors where we could mistake ENOSPC as something more benign > than it is, but the cases seem pretty narrow, due to the previous > fsync() (maybe the other side could be thin provisioned and get an > ENOSPC there - but in that case we didn't actually loose any data. The > only dangerous scenario I can come up with is that the remote side is on > thinly provisioned CoW system, and a concurrent write to an earlier > block runs out of space - but seriously, good riddance to you). This seems to make sense, and has the advantage that it uses interfaces that exist right now. But it seems a bit like we'll have to wait for them to finish building out the errseq_t support for NFS to avoid various races around the mapping's AS_EIO flag (A: fsync() -> EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe we'd have to get at least one of { fd-passing, direct IO, threads } working on our side ... -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: