Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 1389710563.31874.YahooMailNeo@web122303.mail.ne1.yahoo.com Whole thread Raw |
In response to | Re: Linux kernel impact on PostgreSQL performance (Kevin Grittner <kgrittn@ymail.com>) |
Responses |
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
|
List | pgsql-hackers |
First off, I want to give a +1 on everything in the recent posts from Heikki and Hannu. Jan Kara <jack@suse.cz> wrote: > Now the aging of pages marked as volatile as it is currently > implemented needn't be perfect for your needs but you still have > time to influence what gets implemented... Actually developers of > the vrange() syscall were specifically looking for some ideas > what to base aging on. Currently I think it is first marked - > first evicted. The "first marked - first evicted" seems like what we would want. The ability to "unmark" and have the page no longer be considered preferred for eviction would be very nice. That seems to me like it would cover the multiple layers of buffering *clean* pages very nicely (although I know nothing more about vrange() than what has been said on this thread, so I could be missing something). The other side of that is related avoiding multiple writes of the same page as much as possible, while avoid write gluts. The issue here is that PostgreSQL tries to hang on to dirty pages for as long as possible before "writing" them to the OS cache, while the OS tries to avoid writing them to storage for as long as possible until they reach a (configurable) threshold or are fsync'd. The problem is that a under various conditions PostgreSQL may need to write and fsync a lot of dirty pages it has accumulated in a short time. That has an "avalanche" effect, creating a "write glut" which can stall all I/O for a period of many seconds up to a few minutes. If the OS was aware of the dirty pages pending write in the application, and counted those for purposes of calculating when and how much to write, the glut could be avoided. Currently, people configure the PostgreSQL background writer to be very aggressive, configure a small PostgreSQL shared_buffers setting, and/or set the OS thresholds low enough to minimize the problem; but all of these mitigation strategies have their own costs. A new hint that the application has dirtied a page could be used by the OS to improve things this way: When the OS is notified that a page is dirty, it takes action depending on whether the page is considered dirty by the OS. If it is not dirty, the page is immediately discarded from the OS cache. It is known that the application has a modified version of the page that it intends to write, so the version in the OS cache has no value. We don't want this page forcing eviction of vrange()-flagged pages. If it is dirty, any write ordering to storage by the OS based on when the page was written to the OS would be pushed back as far as possible without crossing any write barriers, in hopes that the writes could be combined. Either way, this page is counted toward dirty pages for purposes of calculating how much to write from the OS to storage, and the later write of the page doesn't redundantly add to this number. The combination of these two changes could boost PostgreSQL performance quite a bit, at least for some common workloads. The MMAP approach always seems tempting on first blush, but the need to "pin" pages and the need to assure that dirty pages are not written ahead of the WAL-logging of those pages makes it hard to see how we can use it. The "pin" means that we need to ensure that a particular 8KB page remains available for direct reference by all PostgreSQL processes until it is "unpinned". The other thing we would need is the ability to modify a page with a solid assurance that the modified page would *not* be written to disk until we authorize it. The page would remain pinned until we do authorize write, at which point the changes are available to be written, but can wait for an fsync or accumulations of sufficient dirty pages to cross the write threshold. Next comes the hard part. The page may or may not be unpinned after that, and if it remains pinned or is pinned again, there may be further changes to the page. While the prior changes can be written (and *must* be written for an fsync), these new changes must *not* be until we authorize it. If MMAP can be made to handle that, we could probably use it (and some of the previously-discussed techniques might not be needed), but my understanding is that there is currently no way to do so. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: