Re: [HACKERS] Clock with Adaptive Replacement - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Clock with Adaptive Replacement |
Date | |
Msg-id | CAEepm=0LEKwuTP0oeRuipGaW4EHuTcK1rE5gbbwCjeGugoF2KA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Clock with Adaptive Replacement (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] Clock with Adaptive Replacement
|
List | pgsql-hackers |
On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > ... I > suppose when you read a page in, you could tell the kernel that you > POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could > tell the kernel that you POSIX_FADV_WILLNEED its former contents (in > advance somehow), on the theory that the coldest stuff in the PG cache > should now become the hottest stuff in the OS cache. Of course that > sucks, because the best the kernel can do then is go and read it from > disk, and the goal is to avoid IO. Given a hypothetical way to > "write" "clean" data to the kernel (so it wouldn't mark it dirty and > generate IO, but it would let you read it back without generating IO > if you're lucky), then perhaps you could actually achieve exclusive > caching at the two levels, and use all your physical RAM without > duplication. Craig said essentially the same thing, on the nearby fsync() reliability thread: On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > ... I'd kind of hoped to go in > the other direction if anything, with some kind of pseudo-write op > that let us swap a dirty shared_buffers entry from our shared_buffers > into the OS dirty buffer cache (on Linux at least) and let it handle > writeback, so we reduce double-buffering. Ha! So much for that! I would like to reply to that on this thread which discusses double buffering and performance, to avoid distracting the fsync() thread from its main topic of reliability. I think that idea has potential. Even though I believe that direct IO is the generally the right way to go (that's been RDBMS orthodoxy for a decade or more AFAIK), we'll always want to support buffered IO (as other RDBMSs do). For one thing, not every filesystem supports direct IO, including ZFS. I love ZFS, and its caching is not simply a dumb extension to shared_buffers that you have to go through syscalls to reach: it has state of the art page reclamation, cached data can be LZ4 compressed and there is an optional second level cache which can live on fast storage. Perhaps if you patched PostgreSQL to tell the OS that you won't need pages you've just read, and that you will need pages you've just evicted, you might be able to straighten out some of that U shape by getting more exclusive caching at the two levels. Queued writes would still be double-buffered of course, at least until they complete. Telling the OS to prefetch something that you already have a copy of is annoying and expensive, though. The pie-in-the-sky version of this idea would let you "swap" pages with the kernel, as you put it. Though I was thinking of clean pages, not dirty ones. Then there'd be a non-overlapping set of pages from your select-only pgbench in each cache. Maybe that would look like punread(fd, buf, size, offset) (!), or maybe write(fd, buf, size) followed by fadvise(fd, offset, size, FADV_I_PERSONALLY_GUARANTEE_THIS_DATA_IS_CLEAN_AND_I_CONSIDERED_CONCURRENCY_VERY_CAREFULLY), or maybe pswap(read params... , unread params ...) to read new buffer and unread old buffer at the same time.</crackpot-vapourware-OS> Sadly, even if the simple non-pie-in-the-sky version of the above were to work out and be beneficial on your favourite non-COW filesystem (on which you might as well use direct IO and larger shared_buffers, some day), it may currently be futile on ZFS because I think the fadvise machinery might not even be hooked up (Solaris didn't believe in fadvise on any filesystem IIRC). Not sure, I hope I'm wrong about that. -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: