Thread: Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
> I don't think the bgwriter is going to be able to keep up with I/O bound > backends, but I do think it can scan and set those booleans fast enough > for the backends to then perform the writes. As long as the bgwriter does not do sync writes (which it does not, since that would need a whole lot of work to be performant) it calls write which returns more or less at once. So the bottleneck can only be the fsync. From those you would want at least one per pg disk open in parallel. But I think it should really be left to the OS when it actually does the IO for the writes from the bgwriter inbetween checkpoints. So Imho the target should be to have not much IO open for the checkpoint, so the fsync is fast enough, even if serial. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > So Imho the target should be to have not much IO open for the checkpoint, > so the fsync is fast enough, even if serial. The best we can do is push out dirty pages with write() via the bgwriter and hope that the kernel will see fit to write them before checkpoint time arrives. I am not sure if that hope has basis in fact or if it's just wishful thinking. Most likely, if it does have basis in fact it's because there is a standard syncer daemon forcing a sync() every thirty seconds. That means that instead of an I/O storm every checkpoint interval, we get a smaller I/O storm every 30 seconds. Not sure this is a big improvement. Jan already found out that issuing very frequent sync()s isn't a win. People keep saying that the bgwriter mustn't write pages synchronously because it'd be bad for performance, but I think that analysis is faulty. Performance of what --- the bgwriter? Nonsense, the *point* of the bgwriter is to do the slow tasks. The only argument that has any merit is that O_SYNC or immediate fsync will prevent us from having multiple writes outstanding and thus reduce the efficiency of disk write scheduling. This is a valid point but there is a limit to how many writes we need to have in flight to keep things flowing smoothly. What I'm thinking now is that the bgwriter should issue frequent fsyncs for its writes --- not immediate, but a lot more often than once per checkpoint. Perhaps take one recently-written unsynced file to fsync every time it is about to sleep. You could imagine various rules for deciding which one to sync; perhaps the one with the most writes issued against it since last sync. When we have tablespaces it'd make sense to try to distribute the syncs across tablespaces, on the assumption that the tablespaces are probably on different drives. regards, tom lane
On Thursday 05 February 2004 20:24, Tom Lane wrote: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > So Imho the target should be to have not much IO open for the checkpoint, > > so the fsync is fast enough, even if serial. > > The best we can do is push out dirty pages with write() via the bgwriter > and hope that the kernel will see fit to write them before checkpoint > time arrives. I am not sure if that hope has basis in fact or if it's > just wishful thinking. Most likely, if it does have basis in fact it's > because there is a standard syncer daemon forcing a sync() every thirty > seconds. There are other benefits of writing pages earlier even though they might not get synced immediately. It would tell kernel that this is latest copy of updated buffer. Kernel VFS should make that copy visible to every other backend as well. The buffer manager will fetch the updated copy from VFS cache next time. All without going to disk actually..(Within the 30 seconds window of course..) > People keep saying that the bgwriter mustn't write pages synchronously > because it'd be bad for performance, but I think that analysis is > faulty. Performance of what --- the bgwriter? Nonsense, the *point* > of the bgwriter is to do the slow tasks. The only argument that has > any merit is that O_SYNC or immediate fsync will prevent us from having > multiple writes outstanding and thus reduce the efficiency of disk > write scheduling. This is a valid point but there is a limit to how > many writes we need to have in flight to keep things flowing smoothly. Is it a valid assumption for platforms-that-postgresql-supports that a write call would make changes visible across processes? > > What I'm thinking now is that the bgwriter should issue frequent fsyncs > for its writes --- not immediate, but a lot more often than once per frequent fsyncs or frequent fsyncs per file descriptor written? I thought it was later. Just a thought. Shridhar
Shridhar Daithankar <shridhar@frodo.hserus.net> writes: > There are other benefits of writing pages earlier even though they might not > get synced immediately. Such as? > It would tell kernel that this is latest copy of updated buffer. Kernel VFS > should make that copy visible to every other backend as well. The buffer > manager will fetch the updated copy from VFS cache next time. All without > going to disk actually..(Within the 30 seconds window of course..) This seems quite irrelevant given the way we handle shared buffers. > frequent fsyncs or frequent fsyncs per file descriptor written? I thought it > was later. You can only fsync one FD at a time (too bad ... if there were a multi-file-fsync API it'd solve the overspecified-write-ordering issue). regards, tom lane
Tom Lane wrote: > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> So Imho the target should be to have not much IO open for the checkpoint, >> so the fsync is fast enough, even if serial. > > The best we can do is push out dirty pages with write() via the bgwriter > and hope that the kernel will see fit to write them before checkpoint > time arrives. I am not sure if that hope has basis in fact or if it's > just wishful thinking. Most likely, if it does have basis in fact it's > because there is a standard syncer daemon forcing a sync() every thirty > seconds. Looking at the response time charts I did for showing how vacuum delay is doing, it seems at least on Linux there is hope that that is the case. Those charts have just a regular 5 minute checkpoint with enough checkpoint segments for that, and no other sync effort done at all. The system has a hard time to handle a larger scaled test DB, so it is definitely well saturated with IO. The charts are here: http://developer.postgresql.org/~wieck/vacuum_cost/ > > That means that instead of an I/O storm every checkpoint interval, > we get a smaller I/O storm every 30 seconds. Not sure this is a big > improvement. Jan already found out that issuing very frequent sync()s > isn't a win. In none of those charts I can see any checkpoint caused IO storm any more. Charts I'm currently doing for 7.4.1 show extremely clear spikes at checkpoints. If someone is interested in those as well I will put them up. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > Tom Lane wrote: > > > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > >> So Imho the target should be to have not much IO open for the checkpoint, > >> so the fsync is fast enough, even if serial. > > > > The best we can do is push out dirty pages with write() via the bgwriter > > and hope that the kernel will see fit to write them before checkpoint > > time arrives. I am not sure if that hope has basis in fact or if it's > > just wishful thinking. Most likely, if it does have basis in fact it's > > because there is a standard syncer daemon forcing a sync() every thirty > > seconds. > > Looking at the response time charts I did for showing how vacuum delay > is doing, it seems at least on Linux there is hope that that is the > case. Those charts have just a regular 5 minute checkpoint with enough > checkpoint segments for that, and no other sync effort done at all. > > The system has a hard time to handle a larger scaled test DB, so it is > definitely well saturated with IO. The charts are here: > > http://developer.postgresql.org/~wieck/vacuum_cost/ > > > > > That means that instead of an I/O storm every checkpoint interval, > > we get a smaller I/O storm every 30 seconds. Not sure this is a big > > improvement. Jan already found out that issuing very frequent sync()s > > isn't a win. > > In none of those charts I can see any checkpoint caused IO storm any > more. Charts I'm currently doing for 7.4.1 show extremely clear spikes > at checkpoints. If someone is interested in those as well I will put > them up. So, Jan, are you basically saying that the background writer has solved the checkpoint I/O flood problem, and we just need to deal with changing sync to multiple fsync's at checkpoint? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Jan Wieck wrote: >> Tom Lane wrote: >> >> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: >> >> So Imho the target should be to have not much IO open for the checkpoint, >> >> so the fsync is fast enough, even if serial. >> > >> > The best we can do is push out dirty pages with write() via the bgwriter >> > and hope that the kernel will see fit to write them before checkpoint >> > time arrives. I am not sure if that hope has basis in fact or if it's >> > just wishful thinking. Most likely, if it does have basis in fact it's >> > because there is a standard syncer daemon forcing a sync() every thirty >> > seconds. >> >> Looking at the response time charts I did for showing how vacuum delay >> is doing, it seems at least on Linux there is hope that that is the >> case. Those charts have just a regular 5 minute checkpoint with enough >> checkpoint segments for that, and no other sync effort done at all. >> >> The system has a hard time to handle a larger scaled test DB, so it is >> definitely well saturated with IO. The charts are here: >> >> http://developer.postgresql.org/~wieck/vacuum_cost/ >> >> > >> > That means that instead of an I/O storm every checkpoint interval, >> > we get a smaller I/O storm every 30 seconds. Not sure this is a big >> > improvement. Jan already found out that issuing very frequent sync()s >> > isn't a win. >> >> In none of those charts I can see any checkpoint caused IO storm any >> more. Charts I'm currently doing for 7.4.1 show extremely clear spikes >> at checkpoints. If someone is interested in those as well I will put >> them up. > > So, Jan, are you basically saying that the background writer has solved > the checkpoint I/O flood problem, and we just need to deal with changing > sync to multiple fsync's at checkpoint? ISTM that the background writer at least has the ability to lower the impact of a checkpoint significantly enough that one might not care about it any more. "Has the ability" means, it needs to be adjusted to the actual DB usage. The charts I produced where not done with the default settings, but rather after making the bgwriter a bit more agressive against dirty pages. The whole sync() vs. fsync() discussion is in my opinion nonsense at this point. Without the ability to limit the amount of files to a reasonable number, by employing tablespaces in the form of larger container files, the risk of forcing excessive head movement is simply too high. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > The whole sync() vs. fsync() discussion is in my opinion nonsense at > this point. The sync vs fsync discussion is not about performance, it is about correctness. You can't simply dismiss the fact that we don't know whether a checkpoint is really complete when we write the checkpoint record. I liked the idea put forward by (I think) Kevin Brown, that we issue sync to start the I/O and then a bunch of fsyncs to wait for it to finish. If sync behaves per spec ("all the I/O is scheduled upon return") then the fsyncs will not affect I/O ordering in the least. But they will ensure that we don't proceed until the I/O is all done. Also there is the Windows-port problem of not having sync available. Doing the fsyncs only will provide an adequate, if possibly lower-performing, solution there. regards, tom lane
Jan Wieck <JanWieck@Yahoo.com> writes: > The whole sync() vs. fsync() discussion is in my opinion nonsense at this > point. Without the ability to limit the amount of files to a reasonable number, > by employing tablespaces in the form of larger container files, the risk of > forcing excessive head movement is simply too high. I don't think there was any suggestion of conflating tablespaces with implementing a filesystem in postgres. Tablespaces are just a database entity that database stored objects like tables and indexes are associated to. They group database stored objects and control the storage method and location. The existing storage mechanism, namely a directory with a file for each database object, is perfectly adequate and doesn't have to be replaced to implement tablespaces. All that's needed is that the location of the directory be associated with the "tablespace" of the object rather than be a global constant. Implementing an Oracle-style filesystem is just one more temptation to reimplement OS services in the database. Personally I think it's an awful idea. But even if postgres did it as an option, it wouldn't necessarily have anything to do with tablespaces. -- greg
Greg Stark wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >> The whole sync() vs. fsync() discussion is in my opinion nonsense at this >> point. Without the ability to limit the amount of files to a reasonable number, >> by employing tablespaces in the form of larger container files, the risk of >> forcing excessive head movement is simply too high. > > I don't think there was any suggestion of conflating tablespaces with > implementing a filesystem in postgres. > > Tablespaces are just a database entity that database stored objects like > tables and indexes are associated to. They group database stored objects and > control the storage method and location. > > The existing storage mechanism, namely a directory with a file for each > database object, is perfectly adequate and doesn't have to be replaced to > implement tablespaces. All that's needed is that the location of the directory > be associated with the "tablespace" of the object rather than be a global > constant. > > Implementing an Oracle-style filesystem is just one more temptation to > reimplement OS services in the database. Personally I think it's an awful > idea. But even if postgres did it as an option, it wouldn't necessarily have > anything to do with tablespaces. > Doing this is not just what you call it. In a system with let's say 500 active backends on a database with let's say 1000 things that are represented as a file, you'll need half a million virtual file descriptors. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Doing this is not just what you call it. In a system with let's say 500 > active backends on a database with let's say 1000 things that are > represented as a file, you'll need half a million virtual file descriptors. [shrug] We've been dealing with virtual file descriptors for years. I've seen no indication that they create any performance bottlenecks. regards, tom lane
Tom Lane wrote: > You can only fsync one FD at a time (too bad ... if there were a > multi-file-fsync API it'd solve the overspecified-write-ordering issue). What about aio_fsync()?
Florian Weimer <fw@deneb.enyo.de> writes: > Tom Lane wrote: >> You can only fsync one FD at a time (too bad ... if there were a >> multi-file-fsync API it'd solve the overspecified-write-ordering issue). > What about aio_fsync()? (1) it's unportable; (2) it's not clear that it's any improvement over fsync(). The Single Unix Spec says aio_fsync "returns when the synchronisation request has been initiated or queued to the file or device". Depending on how the implementation works, this may mean that all the dirty blocks have been scheduled for I/O and will be written ahead of subsequently scheduled blocks --- if so, the results are not really different from fsync()'ing the files in the same order. The best idea I've heard so far is the one about sync() followed by a bunch of fsync()s. That seems to be correct, efficient, and dependent only on very-long-established Unix semantics. regards, tom lane
Tom Lane wrote: > The best idea I've heard so far is the one about sync() followed by > a bunch of fsync()s. That seems to be correct, efficient, and dependent > only on very-long-established Unix semantics. Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> Greg Stark wrote: > >> Jan Wieck <JanWieck@Yahoo.com> writes: >> >>> The whole sync() vs. fsync() discussion is in my opinion nonsense at >>> this >>> point. Without the ability to limit the amount of files to a reasonable >>> number, >>> by employing tablespaces in the form of larger container files, the >>> risk of >>> forcing excessive head movement is simply too high. >> >> I don't think there was any suggestion of conflating tablespaces with >> implementing a filesystem in postgres. >> >> Tablespaces are just a database entity that database stored objects like >> tables and indexes are associated to. They group database stored objects >> and >> control the storage method and location. >> >> The existing storage mechanism, namely a directory with a file for each >> database object, is perfectly adequate and doesn't have to be replaced >> to >> implement tablespaces. All that's needed is that the location of the >> directory >> be associated with the "tablespace" of the object rather than be a >> global >> constant. >> >> Implementing an Oracle-style filesystem is just one more temptation to >> reimplement OS services in the database. Personally I think it's an >> awful >> idea. But even if postgres did it as an option, it wouldn't necessarily >> have >> anything to do with tablespaces. >> > > Doing this is not just what you call it. In a system with let's say 500 > active backends on a database with let's say 1000 things that are > represented as a file, you'll need half a million virtual file > descriptors. I'm sort of a purist, I think that operating systems should be operating systems and applications should be applications. Whenever you try to do application like things in an OS, it is a mistake. Whenever you try to do OS like things in an application, it - also, is a mistake. Say a database has close to a active thousand files and you do have 100 concurrent user's. Why do you think that this could be handled better in an application? Are you saying that PostgreSQL could do a better job at managing 1/2 million shared file descriptors than the OS? Your example, IMHO, points out why you *shouldn't* try to have a dedicated file system.