Thread: TODO item
In the TODO file: * -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim) Has this been done in current? I see almost no performance improvement on copying data into a table. -- Tatsuo Ishii
> In the TODO file: > > * -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim) > > Has this been done in current? I see almost no performance > improvement on copying data into a table. TODO updated. That was part of MVCC which originally was supposed to be in 7.0. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > In the TODO file: > > > > * -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim) > > > > Has this been done in current? I see almost no performance > > improvement on copying data into a table. > > TODO updated. That was part of MVCC which originally was supposed to be > in 7.0. Thanks. BTW, I have worked a little bit on this item. The idea is pretty simple. Instead of doing a real fsync() in pg_fsync(), just marking it so that we remember to do fsync() at the commit time. Following patches illustrate the idea. An experience shows that it dramatically boosts the performance of copy. Unfortunately I see virtually no difference for TPC-B like small many concurrent transactions. Maybe we would need WAL for this. Comments? Index: access/transam/xact.c =================================================================== RCS file: /usr/local/cvsroot/pgsql/src/backend/access/transam/xact.c,v retrieving revision 1.60 diff -c -r1.60 xact.c *** access/transam/xact.c 2000/01/29 16:58:29 1.60 --- access/transam/xact.c 2000/02/06 06:12:58 *************** *** 639,644 **** --- 639,646 ---- if (SharedBufferChanged) { FlushBufferPool(); + pg_fsync_pending(); + if (leak) ResetBufferPool(); *************** *** 653,658 **** --- 655,661 ---- */ leak = BufferPoolCheckLeak(); FlushBufferPool(); + pg_fsync_pending(); } if (leak) Index: storage/file/fd.c =================================================================== RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/file/fd.c,v retrieving revision 1.52 diff -c -r1.52 fd.c *** storage/file/fd.c 2000/01/26 05:56:55 1.52 --- storage/file/fd.c 2000/02/06 06:13:01 *************** *** 189,202 **** static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); static char *filepath(char*filename); static long pg_nofile(void); /* * pg_fsync --- same as fsync except does nothing if -F switchwas given */ int pg_fsync(int fd) { ! return disableFsync ? 0 : fsync(fd); } /* --- 189,238 ---- static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); static char *filepath(char*filename); static long pg_nofile(void); + static void alloc_fsync_info(void); + static char *fsync_request; + static int nfds; + /* * pg_fsync --- same as fsync except does nothing if -F switch was given */ int pg_fsync(int fd) + { + if (fsync_request == NULL) + alloc_fsync_info(); + fsync_request[fd] = 1; + return 0; + } + + static void alloc_fsync_info(void) + { + nfds = pg_nofile(); + fsync_request = malloc(nfds); + if (fsync_request == NULL) { + elog(ERROR, "alloc_fsync_info: cannot allocate memory"); + return; + } + } + + void + pg_fsync_pending(void) { ! int i; ! ! if (disableFsync) ! return; ! ! if (fsync_request == NULL) ! alloc_fsync_info(); ! ! for (i=0;i<nfds;i++) { ! if (fsync_request[i]) { ! fsync(i); ! fsync_request[i] = 0; ! } ! } } /*
> BTW, I have worked a little bit on this item. The idea is pretty > simple. Instead of doing a real fsync() in pg_fsync(), just marking it > so that we remember to do fsync() at the commit time. Following > patches illustrate the idea. An experience shows that it dramatically > boosts the performance of copy. Unfortunately I see virtually no > difference for TPC-B like small many concurrent transactions. Maybe we > would need WAL for this. Comments? Can you be more specific. How does fsync work now vs. your proposed change. I did not see that here. Sorry. > > Index: access/transam/xact.c > =================================================================== > RCS file: /usr/local/cvsroot/pgsql/src/backend/access/transam/xact.c,v > retrieving revision 1.60 > diff -c -r1.60 xact.c > *** access/transam/xact.c 2000/01/29 16:58:29 1.60 > --- access/transam/xact.c 2000/02/06 06:12:58 > *************** > *** 639,644 **** > --- 639,646 ---- > if (SharedBufferChanged) > { > FlushBufferPool(); > + pg_fsync_pending(); > + > if (leak) > ResetBufferPool(); > > *************** > *** 653,658 **** > --- 655,661 ---- > */ > leak = BufferPoolCheckLeak(); > FlushBufferPool(); > + pg_fsync_pending(); > } > > if (leak) > Index: storage/file/fd.c > =================================================================== > RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/file/fd.c,v > retrieving revision 1.52 > diff -c -r1.52 fd.c > *** storage/file/fd.c 2000/01/26 05:56:55 1.52 > --- storage/file/fd.c 2000/02/06 06:13:01 > *************** > *** 189,202 **** > static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); > static char *filepath(char *filename); > static long pg_nofile(void); > > /* > * pg_fsync --- same as fsync except does nothing if -F switch was given > */ > int > pg_fsync(int fd) > { > ! return disableFsync ? 0 : fsync(fd); > } > > /* > --- 189,238 ---- > static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); > static char *filepath(char *filename); > static long pg_nofile(void); > + static void alloc_fsync_info(void); > > + static char *fsync_request; > + static int nfds; > + > /* > * pg_fsync --- same as fsync except does nothing if -F switch was given > */ > int > pg_fsync(int fd) > + { > + if (fsync_request == NULL) > + alloc_fsync_info(); > + fsync_request[fd] = 1; > + return 0; > + } > + > + static void alloc_fsync_info(void) > + { > + nfds = pg_nofile(); > + fsync_request = malloc(nfds); > + if (fsync_request == NULL) { > + elog(ERROR, "alloc_fsync_info: cannot allocate memory"); > + return; > + } > + } > + > + void > + pg_fsync_pending(void) > { > ! int i; > ! > ! if (disableFsync) > ! return; > ! > ! if (fsync_request == NULL) > ! alloc_fsync_info(); > ! > ! for (i=0;i<nfds;i++) { > ! if (fsync_request[i]) { > ! fsync(i); > ! fsync_request[i] = 0; > ! } > ! } > } > > /* > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: owner-pgsql-hackers@postgresql.org > [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tatsuo Ishii > > > > In the TODO file: > > > > > > * -Allow transaction commits with rollback with no-fsync > performance [fsync](Vadim) > > > > > > Has this been done in current? I see almost no performance > > > improvement on copying data into a table. > > > > TODO updated. That was part of MVCC which originally was supposed to be > > in 7.0. > > Thanks. > > BTW, I have worked a little bit on this item. The idea is pretty > simple. Instead of doing a real fsync() in pg_fsync(), just marking it > so that we remember to do fsync() at the commit time. Following This seems not good,unfortunately. Note that the backend which calls pg_fsync() for a relation file may be different from the backend which updated shared buffers of the file. The former backend wouldn't necessarily be committed when the latter backend is committed. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> > BTW, I have worked a little bit on this item. The idea is pretty > > simple. Instead of doing a real fsync() in pg_fsync(), just marking it > > so that we remember to do fsync() at the commit time. Following > > patches illustrate the idea. An experience shows that it dramatically > > boosts the performance of copy. Unfortunately I see virtually no > > difference for TPC-B like small many concurrent transactions. Maybe we > > would need WAL for this. Comments? > > > Can you be more specific. How does fsync work now vs. your proposed > change. I did not see that here. Sorry. As already pointed out by many people, current buffer manager is not very smart on flushing out dirty pages. From TODO.detail/fsync: >This is the problem of buffer manager, known for very long time: >when copy eats all buffers, manager begins write/fsync each >durty buffer to free buffer for new data. All updated relations >should be fsynced _once_ @ transaction commit. You would get >the same results without -F... With my changes, pg_fsync would just mark the relation (actually its file descriptor) as it is needed fsync, instead of calling real fsync. Upon transaction commit, the mark would be checked and relations are fsynced if necessary. BTW, Hiroshi has raised a question with my changes, and I have written to him (in Japanese, of course:-) to make sure that what I'm missing here. I will let you know the result later. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >>>> BTW, I have worked a little bit on this item. The idea is pretty >>>> simple. Instead of doing a real fsync() in pg_fsync(), just marking it >>>> so that we remember to do fsync() at the commit time. Following >>>> patches illustrate the idea. In the form you have shown it, it would be completely useless, for two reasons: 1. It doesn't guarantee that the right files are fsync'd. It would in fact fsync whichever files happen to be using the same kernel file descriptor numbers at the close of the transaction as the ones you really wanted to fsync were using at the time fsync was requested. 2. It doesn't guarantee that the files are fsync'd in the right order. Per my discussion a few days ago, the only reason for doing fsync at all is to guarantee that the data pages touched by a transaction get flushed to disk before the pg_log update claiming that the transaction is done gets flushed to disk. A change like this completely destroys that ordering, since pg_fsync_pending has no idea which fd is pg_log. You could possibly fix #1 by logging fsync requests at the vfd level; then, whenever a vfd is closed to free up a kernel fd, check the fsync flag and execute the pending fsync before closing the file. You could possibly fix #2 by having transaction commit invoke the pg_fsync_pending scan before it updates pg_log (and then fsyncing pg_log itself again after). (Actually, you could probably eliminate the notion of "fsync request" entirely, and simply have each vfd get marked "dirty" automatically when written to. Both closing a vfd and the scan at xact commit would look at the dirty bit to decide to do fsync.) What would still need to be thought about is whether this scheme preserves the ordering guarantee when a group of concurrent backends is considered, rather than one backend in isolation. (I believe that fsync() will apply to all dirty kernel buffers for a file, not just those dirtied by the requesting process, so each backend's fsyncs can affect the order in which other backends' writes hit the disk.) Offhand I do not see any problems there, but it's the kind of thing that requires more than offhand thought... regards, tom lane
> You could possibly fix #1 by logging fsync requests at the vfd level; > then, whenever a vfd is closed to free up a kernel fd, check the fsync > flag and execute the pending fsync before closing the file. You could > possibly fix #2 by having transaction commit invoke the pg_fsync_pending > scan before it updates pg_log (and then fsyncing pg_log itself again > after). > > (Actually, you could probably eliminate the notion of "fsync request" > entirely, and simply have each vfd get marked "dirty" automatically when > written to. Both closing a vfd and the scan at xact commit would look > at the dirty bit to decide to do fsync.) > > What would still need to be thought about is whether this scheme > preserves the ordering guarantee when a group of concurrent backends > is considered, rather than one backend in isolation. (I believe that > fsync() will apply to all dirty kernel buffers for a file, not just > those dirtied by the requesting process, so each backend's fsyncs can > affect the order in which other backends' writes hit the disk.) > Offhand I do not see any problems there, but it's the kind of thing > that requires more than offhand thought... Glad someone is looking into this. Seems the above concern about order is fine because it is only marking the pg_log transactions as committed that is important. You can fsync anything you want, you just need to make sure you current transactions buffers are fsync'ed before you mark the transaction as complete. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >>>> BTW, I have worked a little bit on this item. The idea is pretty > >>>> simple. Instead of doing a real fsync() in pg_fsync(), just marking it > >>>> so that we remember to do fsync() at the commit time. Following > >>>> patches illustrate the idea. > > What would still need to be thought about is whether this scheme > preserves the ordering guarantee when a group of concurrent backends > is considered, rather than one backend in isolation. (I believe that > fsync() will apply to all dirty kernel buffers for a file, not just > those dirtied by the requesting process, so each backend's fsyncs can > affect the order in which other backends' writes hit the disk.) > Offhand I do not see any problems there, but it's the kind of thing > that requires more than offhand thought... The following is an example of what I first pointed out. I say about PostgreSQL shared buffers not kernel buffers. Session-1 begin; update A ...; Session-2 begin; select * fromB ..; There's no PostgreSQL shared buffer available. This backend has to force the flush of a free buffer page. Unfortunately the page was dirtied by the above operation of Session-1 and calls pg_fsync() for the tableA. However fsync() is postponed until commit of this backend. Session-1 commit; There's no dirty buffer page for the table A. So pg_fsync() isn't called for the table A. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> 1. It doesn't guarantee that the right files are fsync'd. It would > in fact fsync whichever files happen to be using the same kernel > file descriptor numbers at the close of the transaction as the ones > you really wanted to fsync were using at the time fsync was requested. Right. If a VFD is reused, the fd would not point to the same file anymore. > You could possibly fix #1 by logging fsync requests at the vfd level; > then, whenever a vfd is closed to free up a kernel fd, check the fsync > flag and execute the pending fsync before closing the file. You could > possibly fix #2 by having transaction commit invoke the pg_fsync_pending > scan before it updates pg_log (and then fsyncing pg_log itself again > after). I do not understand #2. I call pg_fsync_pending twice in RecordTransactionCommit, one is after FlushBufferPool, and the other is after TansactionIdCommit and FlushBufferPool. Or am I missing something? > What would still need to be thought about is whether this scheme > preserves the ordering guarantee when a group of concurrent backends > is considered, rather than one backend in isolation. (I believe that > fsync() will apply to all dirty kernel buffers for a file, not just > those dirtied by the requesting process, so each backend's fsyncs can > affect the order in which other backends' writes hit the disk.) > Offhand I do not see any problems there, but it's the kind of thing > that requires more than offhand thought... I thought about that too. If the ordering was that important, a database managed by backends with -F on could be seriously corrupted. I've never heard of such disasters caused by -F. So my conclusion was that it's safe or I had been so lucky. Note that I'm not talking about pg_log vs. relations but the ordering among relations. BTW, Hiroshi has noticed me an excellent point #3: >Session-1 >begin; >update A ...; > >Session-2 >begin; >select * fromB ..; > There's no PostgreSQL shared buffer available. > This backend has to force the flush of a free buffer > page. Unfortunately the page was dirtied by the > above operation of Session-1 and calls pg_fsync() > for the table A. However fsync() is postponed until > commit of this backend. > >Session-1 >commit; > There's no dirty buffer page for the table A. > So pg_fsync() isn't called for the table A. Seems there's no easy solution for this. Maybe now is the time to give up my idea... -- Tatsuo Ishii
> BTW, Hiroshi has noticed me an excellent point #3: > > >Session-1 > >begin; > >update A ...; > > > >Session-2 > >begin; > >select * fromB ..; > > There's no PostgreSQL shared buffer available. > > This backend has to force the flush of a free buffer > > page. Unfortunately the page was dirtied by the > > above operation of Session-1 and calls pg_fsync() > > for the table A. However fsync() is postponed until > > commit of this backend. > > > >Session-1 > >commit; > > There's no dirty buffer page for the table A. > > So pg_fsync() isn't called for the table A. > > Seems there's no easy solution for this. Maybe now is the time to give > up my idea... I hate to see you give up on this. Don't tell me we fsync on every buffer write, and not just at transaction commit? That is terrible. What if we set a flag on the file descriptor stating we dirtied/wrote one of its buffers during the transaction, and cycle through the file descriptors on buffer commit and fsync all involved in the transaction. We also fsync if we close a file descriptor and it was involved in the transaction. We clear the "involved in this transaction" flag on commit too. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> possibly fix #2 by having transaction commit invoke the pg_fsync_pending >> scan before it updates pg_log (and then fsyncing pg_log itself again >> after). > I do not understand #2. I call pg_fsync_pending twice in > RecordTransactionCommit, one is after FlushBufferPool, and the other > is after TansactionIdCommit and FlushBufferPool. Or am I missing > something? Oh, OK. That's what I meant. The snippet you posted didn't show where you were calling the fsync routine from. > I thought about that too. If the ordering was that important, a > database managed by backends with -F on could be seriously > corrupted. I've never heard of such disasters caused by -F. This is why I think that fsync actually offers very little extra protection ;-) > BTW, Hiroshi has noticed me an excellent point #3: >> This backend has to force the flush of a free buffer >> page. Unfortunately the page was dirtied by the >> above operation of Session-1 and calls pg_fsync() >> for the table A. However fsync() is postponed until >> commit of this backend. >> >> Session-1 >> commit; >> There's no dirty buffer page for the table A. >> So pg_fsync() isn't called for the table A. Oooh, right. Backend A dirties the page, but leaves it sitting in shared buffer. Backend B needs the buffer space, so it does the fwrite of the page. Now if backend A wants to commit, it can fsync everything it's written --- but does that guarantee the page that was actually written by B will get flushed to disk? Not sure. If the pending-fsync logic is based on either physical fds or vfds then it definitely *won't* work; A might have found the desired page sitting in buffer cache to begin with, and never have opened the underlying file at all! So it seems you would need to keep a list of all the relation files (and segments) you've written to in the current xact, and open and fsync each one just before writing/fsyncing pg_log. Even then, you're assuming that fsync applied to a file via an fd belonging to one backend will flush disk buffers written to the same file via *other* fds belonging to *other* processes. I'm not sure that that is true on all Unixes... heck, I'm not sure it's true on any. The fsync(2) man page here isn't real specific. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Don't tell me we fsync on every buffer write, and not just at > transaction commit? That is terrible. If you don't have -F set, yup. Why did you think fsync mode was so slow? > What if we set a flag on the file descriptor stating we dirtied/wrote > one of its buffers during the transaction, and cycle through the file > descriptors on buffer commit and fsync all involved in the transaction. That's exactly what Tatsuo was describing, I believe. I think Hiroshi has pointed out a serious problem that would make it unreliable when multiple backends are running: if some *other* backend fwrites the page instead of your backend, and it doesn't fsync until *its* transaction is done (possibly long after yours), then you lose the ordering guarantee that is the point of the whole exercise... regards, tom lane
At 11:31 AM 2/7/00 -0500, Bruce Momjian wrote: >I hate to see you give up on this. >Don't tell me we fsync on every buffer write, and not just at >transaction commit? That is terrible. Won't we have many more options in this area, i.e. increasing performance while maintaining on-disk data integrity, once WAL is implemented? snapshot+WAL = your database so in theory -F on tables and the transaction log would be safe as long as you have a snapshot and as long as the WAL is being fsync'd and you have the disk space to hold the WAL until you update your snapshot, no? - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Don't tell me we fsync on every buffer write, and not just at > > transaction commit? That is terrible. > > If you don't have -F set, yup. Why did you think fsync mode was > so slow? > > > What if we set a flag on the file descriptor stating we dirtied/wrote > > one of its buffers during the transaction, and cycle through the file > > descriptors on buffer commit and fsync all involved in the transaction. > > That's exactly what Tatsuo was describing, I believe. I think Hiroshi > has pointed out a serious problem that would make it unreliable when > multiple backends are running: if some *other* backend fwrites the page > instead of your backend, and it doesn't fsync until *its* transaction is > done (possibly long after yours), then you lose the ordering guarantee > that is the point of the whole exercise... OK, I understand now. You are saying if my backend dirties a buffer, but another backend does the write, would my backend fsync() that buffer that the other backend wrote. I can't imagine how fsync could flush _only_ the file discriptor buffers modified by the current process. It would have to affect all buffers for the file descriptor. BSDI says: Fsync() causes all modified data and attributes of fd to be moved to a permanent storage device. This normally resultsin all in-core modified copies of buffers for the associated file to be written to a disk. Looking at the BSDI kernel, there is a user-mode file descriptor table, which maps to a kernel file descriptor table. This table can be shared, so a file descriptor opened multiple times, like in a fork() call. The kernel table maps to an actual file inode/vnode that maps to a file. The only thing that is kept in the file descriptor table is the current offset in the file (struct file in BSD). There is no mapping of who wrote which blocks. In fact, I would suggest that any kernel implementation that could track such things would be pretty broken. I can imagine some cases the use of that mapping of blocks to file descriptors would cause compatibility problems. Those buffers have to be shared by all processes. So, I think we are safe if we can either keep that file descriptor open until commit, or re-open it and fsync it on commit. That assume a re-open is hitting the same file. My opinion is that we should just fsync it on close and not worry about a reopen. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [000207 10:14] wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Don't tell me we fsync on every buffer write, and not just at > > > transaction commit? That is terrible. > > > > If you don't have -F set, yup. Why did you think fsync mode was > > so slow? > > > > > What if we set a flag on the file descriptor stating we dirtied/wrote > > > one of its buffers during the transaction, and cycle through the file > > > descriptors on buffer commit and fsync all involved in the transaction. > > > > That's exactly what Tatsuo was describing, I believe. I think Hiroshi > > has pointed out a serious problem that would make it unreliable when > > multiple backends are running: if some *other* backend fwrites the page > > instead of your backend, and it doesn't fsync until *its* transaction is > > done (possibly long after yours), then you lose the ordering guarantee > > that is the point of the whole exercise... > > OK, I understand now. You are saying if my backend dirties a buffer, > but another backend does the write, would my backend fsync() that buffer > that the other backend wrote. > > I can't imagine how fsync could flush _only_ the file discriptor buffers > modified by the current process. It would have to affect all buffers > for the file descriptor. > > BSDI says: > > Fsync() causes all modified data and attributes of fd to be moved to a > permanent storage device. This normally results in all in-core modified > copies of buffers for the associated file to be written to a disk. > > Looking at the BSDI kernel, there is a user-mode file descriptor table, > which maps to a kernel file descriptor table. This table can be shared, > so a file descriptor opened multiple times, like in a fork() call. The > kernel table maps to an actual file inode/vnode that maps to a file. > The only thing that is kept in the file descriptor table is the current > offset in the file (struct file in BSD). There is no mapping of who > wrote which blocks. > > In fact, I would suggest that any kernel implementation that could track > such things would be pretty broken. I can imagine some cases the use of > that mapping of blocks to file descriptors would cause compatibility > problems. Those buffers have to be shared by all processes. > > So, I think we are safe if we can either keep that file descriptor open > until commit, or re-open it and fsync it on commit. That assume a > re-open is hitting the same file. My opinion is that we should just > fsync it on close and not worry about a reopen. I'm pretty sure that the standard is that a close on a file _should_ fsync it. In re the fsync problems... I came across this option when investigating implementing range fsync() for FreeBSD, 'O_FSYNC'/'O_SYNC'. Why not keep 2 file descritors open for each datafile, one opened with O_FSYNC (exists but not documented in FreeBSD) and one normal? This garantees sync writes for all write operations on that fd. Most unicies offer an open flag for this type of access although the name will vary (Linux/Solaris uses O_SYNC afaik). When a sync write is needed then use that filedescriptor to do the writing, and use the normal one for non-sync writes. This would fix the problem where another backend causes an out-of-order or unsafe fsync to occur. Another option is using mmap() and msync() to achive the same effect, the only problem with mmap() is that under most i386 systems you are limited to a < 4gig (2gig with FreeBSD) mapping that would have to be 'windowed' over the datafiles, however depending on the locality of accesses this may be much more effecient that read/write semantics. Not to mention that a lot of unicies have broken mmap() implementations and problems with merged vm/buffercache. Yes, I haven't looked at the backend code, just hoping to offer some useful suggestions. -Alfred
> > So, I think we are safe if we can either keep that file descriptor open > > until commit, or re-open it and fsync it on commit. That assume a > > re-open is hitting the same file. My opinion is that we should just > > fsync it on close and not worry about a reopen. > > I'm pretty sure that the standard is that a close on a file _should_ > fsync it. This is not true. close flushes the user buffers to kernel buffers. It does not force to physical disk in all cases, I think. There is really no need to force them to disk on close. The only time they have to be forced to disk is when the system shuts down, or on an fsync call. > > In re the fsync problems... > > I came across this option when investigating implementing range fsync() > for FreeBSD, 'O_FSYNC'/'O_SYNC'. > > Why not keep 2 file descritors open for each datafile, one opened > with O_FSYNC (exists but not documented in FreeBSD) and one normal? > This garantees sync writes for all write operations on that fd. We actually don't want this. We like to just fsync the file descriptor and retroactively fsync all our writes. fsync allows us to decouple the write and the fsync, which is what we really are attempting to do. Our current behavour is to do write/fsync together, which is wasteful. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [000207 11:00] wrote: > > > So, I think we are safe if we can either keep that file descriptor open > > > until commit, or re-open it and fsync it on commit. That assume a > > > re-open is hitting the same file. My opinion is that we should just > > > fsync it on close and not worry about a reopen. > > > > I'm pretty sure that the standard is that a close on a file _should_ > > fsync it. > > This is not true. close flushes the user buffers to kernel buffers. It > does not force to physical disk in all cases, I think. There is really > no need to force them to disk on close. The only time they have to be > forced to disk is when the system shuts down, or on an fsync call. > > > > > In re the fsync problems... > > > > I came across this option when investigating implementing range fsync() > > for FreeBSD, 'O_FSYNC'/'O_SYNC'. > > > > Why not keep 2 file descritors open for each datafile, one opened > > with O_FSYNC (exists but not documented in FreeBSD) and one normal? > > This garantees sync writes for all write operations on that fd. > > We actually don't want this. We like to just fsync the file descriptor > and retroactively fsync all our writes. fsync allows us to decouple the > write and the fsync, which is what we really are attempting to do. Our > current behavour is to do write/fsync together, which is wasteful. Yes, the way I understand it is that one backend doing the fsync will sync the entire file perhaps forcing a sync in the middle of a somewhat critical update being done by another instance of the backend. Since the current behavior seems to be write/fsync/write/fsync... instead of write/write/write/fsync you may as well try opening the filedescriptor with O_FSYNC on operating systems that support it to avoid the cross-fsync problem. Another option is to use O_FSYNC descriptiors and aio_write to allow a sync writes to be 'backgrounded'. More and more unix OS's are supporting aio nowadays. I'm aware of the performance implications sync writes cause, but using fsync after every write seems to cause massive amounts of unessesary disk IO that could be avoided with using explicit sync descriptors with little increase in complexity considering what I understand of the current implementation. Basically it would seem to be a good hack until you get the algorithm to batch fsyncs working. (write/write/write.../fsync) At that point you may want to window over the files using msync(), but there may be a better way, one that allows a vector of io to be scheduled for sync write in one go, rather than a buffer at a time. -Alfred
> Yes, the way I understand it is that one backend doing the fsync > will sync the entire file perhaps forcing a sync in the middle of > a somewhat critical update being done by another instance of the > backend. We don't mind that. Until the transaction is marked as complete, they can fsync anything we want. We just want all stuff modified by a transaction fsynced before a transaction is marked as completed. > I'm aware of the performance implications sync writes cause, but > using fsync after every write seems to cause massive amounts of > unessesary disk IO that could be avoided with using explicit > sync descriptors with little increase in complexity considering > what I understand of the current implementation. Yes. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I can't imagine how fsync could flush _only_ the file discriptor buffers > modified by the current process. It would have to affect all buffers > for the file descriptor. Yeah, you're probably right. After thinking about it, I can't believe that a disk block buffer inside the kernel has any record of which FD it was written by (after all, it could have been dirtied through more than one FD since it was last synced to disk). All it's got is a file inode number and a block number within the file. Presumably fsync() searches the buffer cache for blocks that match the FD's inode number and schedules I/O for all the ones that are dirty. > So, I think we are safe if we can either keep that file descriptor open > until commit, or re-open it and fsync it on commit. That assume a > re-open is hitting the same file. My opinion is that we should just > fsync it on close and not worry about a reopen. There's still the problem that your backend might never have opened the relation file at all, still less done a write through its fd or vfd. I think we would need to have a separate data structure saying "these relations were dirtied in the current xact" that is not tied to fd's or vfd's. Maybe the relcache would be a good place to keep such a flag. Transaction commit would look like: * scan buffer cache for dirty buffers, fwrite each one that belongs to one of the relations I'm trying to commit; * open and fsync each segment of each rel that I'm trying to commit (or maybe just the dirtied segments, if we want to do the bookkeeping at that level of detail); * make pg_log entry; * write and fsync pg_log. fsync-on-close is probably a waste of cycles. The only way that would matter is if someone else were doing a RENAME TABLE on the rel, thus preventing you from reopening it. I think we could just put the responsibility on the renamer to fsync the file while he's doing it (in fact I think that's already in there, at least to the extent of flushing the buffer cache). regards, tom lane
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Don't tell me we fsync on every buffer write, and not just at > > > transaction commit? That is terrible. > > > > If you don't have -F set, yup. Why did you think fsync mode was > > so slow? > > > > > What if we set a flag on the file descriptor stating we dirtied/wrote > > > one of its buffers during the transaction, and cycle through the file > > > descriptors on buffer commit and fsync all involved in the > transaction. > > > > That's exactly what Tatsuo was describing, I believe. I think Hiroshi > > has pointed out a serious problem that would make it unreliable when > > multiple backends are running: if some *other* backend fwrites the page > > instead of your backend, and it doesn't fsync until *its* transaction is > > done (possibly long after yours), then you lose the ordering guarantee > > that is the point of the whole exercise... > > OK, I understand now. You are saying if my backend dirties a buffer, > but another backend does the write, would my backend fsync() that buffer > that the other backend wrote. > > I can't imagine how fsync could flush _only_ the file discriptor buffers > modified by the current process. It would have to affect all buffers > for the file descriptor. > > BSDI says: > > Fsync() causes all modified data and attributes of fd to be > moved to a > permanent storage device. This normally results in all > in-core modified > copies of buffers for the associated file to be written to a disk. > > Looking at the BSDI kernel, there is a user-mode file descriptor table, > which maps to a kernel file descriptor table. This table can be shared, > so a file descriptor opened multiple times, like in a fork() call. The > kernel table maps to an actual file inode/vnode that maps to a file. > The only thing that is kept in the file descriptor table is the current > offset in the file (struct file in BSD). There is no mapping of who > wrote which blocks. > > In fact, I would suggest that any kernel implementation that could track > such things would be pretty broken. I can imagine some cases the use of > that mapping of blocks to file descriptors would cause compatibility > problems. Those buffers have to be shared by all processes. > > So, I think we are safe if we can either keep that file descriptor open > until commit, or re-open it and fsync it on commit. That assume a > re-open is hitting the same file. My opinion is that we should just > fsync it on close and not worry about a reopen. > I asked about this question 4 months ago but got no answer. Obviouly this needs not only md/fd stuff changes but also bufmgr changes. Keeping dirtied list of segments of each backend seems to work. But I'm afraid of other oversights. The problem is that this feature is very difficult to verify. In addtion WAL would solve this item naturally. Is it still valuable to solve this item in current spec ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > Is it still valuable to solve this item in current spec ? I'd be inclined to forget about it for now, and see what happens with WAL. It looks like a fair amount of work for a problem that will go away anyway in a release or so... regards, tom lane
Bruce Momjian wrote: > > > > So, I think we are safe if we can either keep that file descriptor open > > > until commit, or re-open it and fsync it on commit. That assume a > > > re-open is hitting the same file. My opinion is that we should just > > > fsync it on close and not worry about a reopen. > > > > I'm pretty sure that the standard is that a close on a file _should_ > > fsync it. > > This is not true. close flushes the user buffers to kernel buffers. It > does not force to physical disk in all cases, I think. fclose flushes user buffers to kernel buffers. close only frees the file descriptor for re-use.
> > So, I think we are safe if we can either keep that file descriptor open > > until commit, or re-open it and fsync it on commit. That assume a > > re-open is hitting the same file. My opinion is that we should just > > fsync it on close and not worry about a reopen. > > There's still the problem that your backend might never have opened the > relation file at all, still less done a write through its fd or vfd. > I think we would need to have a separate data structure saying "these > relations were dirtied in the current xact" that is not tied to fd's or > vfd's. Maybe the relcache would be a good place to keep such a flag. > > Transaction commit would look like: > > * scan buffer cache for dirty buffers, fwrite each one that belongs > to one of the relations I'm trying to commit; > > * open and fsync each segment of each rel that I'm trying to commit > (or maybe just the dirtied segments, if we want to do the bookkeeping > at that level of detail); By fsync'ing on close, we can not worry about file descriptors that were forced out of the file descriptor cache during the transaction. If we dirty a buffer, we have to mark the buffer as dirty, and the file descriptor associated with that buffer needing fsync. If someone else writes and removes that buffer from the cache before we get to commit it, the file descriptor flag will tell us the file descriptor needs fsync. We have to: write our dirty buffersfsync all file descriptors marked as "written" during our transactionfsync all file descriptors onclose when being cycled out of fd cache(fd close has to write dirty buffers before fsync) So we have three states for a write: still in dirty bufferfile descriptor marked as dirty/need fsyncfile descriptor removed from cache, fsync'ed on close Seems this covers all the cases. > > * make pg_log entry; > > * write and fsync pg_log. Yes. > > fsync-on-close is probably a waste of cycles. The only way that would > matter is if someone else were doing a RENAME TABLE on the rel, thus > preventing you from reopening it. I think we could just put the > responsibility on the renamer to fsync the file while he's doing it > (in fact I think that's already in there, at least to the extent of > flushing the buffer cache). I hadn't thought of that case. I was thinking of file descriptor cache removal, or don't they get removed if they are in use? If not, you can skip my close examples. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > Is it still valuable to solve this item in current spec ? > > I'd be inclined to forget about it for now, and see what happens > with WAL. It looks like a fair amount of work for a problem that > will go away anyway in a release or so... But is seems Tatsuo is pretty close to it. I personally would like to see it in 7.0. Even with WAL, we may decide to allow non-WAL mode, and if so, this code would still be useful. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > So, I think we are safe if we can either keep that file descriptor open > > until commit, or re-open it and fsync it on commit. That assume a > > re-open is hitting the same file. My opinion is that we should just > > fsync it on close and not worry about a reopen. > > > > I asked about this question 4 months ago but got no answer. > Obviouly this needs not only md/fd stuff changes but also bufmgr > changes. Keeping dirtied list of segments of each backend seems > to work. But I'm afraid of other oversights. I don't think so. We can just mark file descriptors as needing fsync(). By doing that, we can spin through the buffer cache for each need_fsync file desciptor, perform any writes needed, and fsync the descriptor. Seems like little redesign needed, except for adding the need_fsync flag. Should be no more than about 20 lines. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian > > > > So, I think we are safe if we can either keep that file > descriptor open > > > until commit, or re-open it and fsync it on commit. That assume a > > > re-open is hitting the same file. My opinion is that we should just > > > fsync it on close and not worry about a reopen. > > > > There's still the problem that your backend might never have opened the > > relation file at all, still less done a write through its fd or vfd. > > I think we would need to have a separate data structure saying "these > > relations were dirtied in the current xact" that is not tied to fd's or > > vfd's. Maybe the relcache would be a good place to keep such a flag. > > > > Transaction commit would look like: > > > > * scan buffer cache for dirty buffers, fwrite each one that belongs > > to one of the relations I'm trying to commit; > > > > * open and fsync each segment of each rel that I'm trying to commit > > (or maybe just the dirtied segments, if we want to do the bookkeeping > > at that level of detail); > > By fsync'ing on close, we can not worry about file descriptors that were > forced out of the file descriptor cache during the transaction. > > If we dirty a buffer, we have to mark the buffer as dirty, and the file > descriptor associated with that buffer needing fsync. If someone else What is the file descriptors associated with buffers ? Would you call heap_open() etc each time when a buffer is about to be dirtied? I don't object to you strongly but I ask again. There's already -F option for speeding up. Who would want non-WAL mode with strict reliabilty after WAL is implemented ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
At 10:04 AM 2/8/00 +0900, Hiroshi Inoue wrote: >There's already -F option for speeding up. >Who would want non-WAL mode with strict reliabilty after WAL >is implemented ? Exactly. I suspect WAL will actually run faster, or at least will have that potential when its existence is fully exploited, than non-WAL non -F. And it seems to me that touching something as crucial as disk management in a fundamental way one week before the release of a hopefully solid beta is pushing things a bit. But, then again, I'm the resident paranoid conservative, I guess. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> > > * open and fsync each segment of each rel that I'm trying to commit > > > (or maybe just the dirtied segments, if we want to do the bookkeeping > > > at that level of detail); > > > > By fsync'ing on close, we can not worry about file descriptors that were > > forced out of the file descriptor cache during the transaction. > > > > If we dirty a buffer, we have to mark the buffer as dirty, and the file > > descriptor associated with that buffer needing fsync. If someone else > > What is the file descriptors associated with buffers ? > Would you call heap_open() etc each time when a buffer is about > to be dirtied? WriteBuffer -> FlushBuffer to flush buffer. Buffer can be either marked dirty or written/fsync to disk. If written/fsync, smgr_flush -> mdflush -> _mdfd_getseg gets MdfdVec structure of file descriptor. When doing flush here, mark MdfdVec structure new element needs_fsync to true. Don't do fsync yet. If just marked dirty, also mark MdfdVec.needs_fsync as true. Do we currently all write dirty buffers on transaction commit? We certainly must already do that in fsync mode. On commit, run through virtial file descriptor table and do fsyncs on file descriptors. No need to find the buffers attached to file descriptors. They have already been written by other code. They just need fsync. > There's already -F option for speeding up. > Who would want non-WAL mode with strict reliabilty after WAL > is implemented ? Let's see what Vadim says. Seems like a nice performance boost and 7.0 could be 6 months away. If we didn't ship with fsync enabled, I wouldn't care. Also, Vadim has a new job, so we really can't be sure about WAL in 7.1. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> If we dirty a buffer, we have to mark the buffer as dirty, and the file >> descriptor associated with that buffer needing fsync. If someone else > What is the file descriptors associated with buffers ? I was about to make exactly that remark. A shared buffer doesn't have an "associated file descriptor", certainly not one that's valid across multiple backends. AFAICS no bookkeeping based on file descriptors (either kernel FDs or vfds) can possibly work correctly in the multiple-backend case. We *have to* do the bookkeeping on a relation basis, and that potentially means (re)opening the relation's file at xact commit in order to do an fsync. There is no value in having one backend fsync an FD before closing the FD, because that does not take account of what other backends may have done or do later with that same file through their own FDs for it. If we do not do an fsync at end of transaction, we cannot be sure that writes initiated by *other* backends will be complete. > There's already -F option for speeding up. > Who would want non-WAL mode with strict reliabilty after WAL > is implemented ? Yes. We have a better solution in the pipeline, so ISTM it's not worth expending a lot of effort on a stopgap. regards, tom lane
> -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > > wouldn't care. Also, Vadim has a new job, so we really can't be sure > about WAL in 7.1. > Oops,it's a big problem. If so,we may have to do something about this item. However it seems too late for 7.0. This isn't a kind of item which beta could verify. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Seems like little redesign needed, except for adding the need_fsync > flag. Should be no more than about 20 lines. If you think this is a twenty line fix, take a deep breath and back away slowly. You have not understood the problem. The problem comes in when *some other* backend has written out a shared buffer that contained a change that our backend made as part of the transaction that it now wants to commit. Without immediate- fsync-on-write (the current solution), there is no guarantee that the other backend will do an fsync any time soon; it might be busy in a very long-running transaction. Our backend must fsync that file, and it must do so after the other backend flushed the buffer. But there is no existing data structure that our backend can use to discover that it must do this. The shared buffer cannot record it; it might belong to some other file entirely by now (and in any case, the shared buffer is noplace to record per-transaction status info). Our backend cannot use either FD or VFD to record it, since it might never have opened the relation file at all, and certainly might have closed it again (and recycled the FD or VFD) before the other backend flushed the shared buffer. The relcache might possibly work as a place to record the need for fsync --- but I am concerned about the relcache's willingness to drop entries if they are not currently heap_open'd; also, md/fd don't currently use the relcache at all. This is not a trivial change. regards, tom lane
At 10:26 PM 2/7/00 -0500, Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Seems like little redesign needed, except for adding the need_fsync >> flag. Should be no more than about 20 lines. > >If you think this is a twenty line fix, take a deep breath and back >away slowly. You have not understood the problem. And, again, thank you. >This is not a trivial change. I was actually through that code months ago, wondering why (ahem) PG was so stupid about disk I/O and reached the same conclusion. Therefore, I was more than pleased when a simple fix to get rid of fsync's on read-only transactions arose. In my application space, this alone gave a huge performance boost. WAL...that's it. If Vadim is going to be unavailable because of his new job, we'll need to figure out another way to do it. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 10:00 PM 2/7/00 -0500, Tom Lane wrote: >"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> There's already -F option for speeding up. >> Who would want non-WAL mode with strict reliabilty after WAL >> is implemented ? >Yes. We have a better solution in the pipeline, so ISTM it's not >worth expending a lot of effort on a stopgap. Thanks to both of you. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> The problem comes in when *some other* backend has written out a > shared buffer that contained a change that our backend made as part > of the transaction that it now wants to commit. Without immediate- > fsync-on-write (the current solution), there is no guarantee that the > other backend will do an fsync any time soon; it might be busy in > a very long-running transaction. Our backend must fsync that file, > and it must do so after the other backend flushed the buffer. But > there is no existing data structure that our backend can use to > discover that it must do this. The shared buffer cannot record it; > it might belong to some other file entirely by now (and in any case, > the shared buffer is noplace to record per-transaction status info). > Our backend cannot use either FD or VFD to record it, since it might > never have opened the relation file at all, and certainly might have > closed it again (and recycled the FD or VFD) before the other backend > flushed the shared buffer. The relcache might possibly work as a > place to record the need for fsync --- but I am concerned about the > relcache's willingness to drop entries if they are not currently > heap_open'd; also, md/fd don't currently use the relcache at all. OK, I will admit I must be wrong, but I would like to understand why. I am suggesting opening and marking a file descriptor as needing fsync even if I only dirty the buffer and not write it. I understand another backend may write my buffer and remove it before I commit my transaction. However, I will be the one to fsync it. I am also suggesting that such file descriptors never get recycled until transaction commit. Is that wrong? -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am suggesting opening and marking a file descriptor as needing fsync > even if I only dirty the buffer and not write it. I understand another > backend may write my buffer and remove it before I commit my > transaction. However, I will be the one to fsync it. I am also > suggesting that such file descriptors never get recycled until > transaction commit. > Is that wrong? I see where you're going, and you could possibly make it work, but there are a bunch of problems. One objection is that kernel FDs are a very finite resource on a lot of platforms --- you don't really want to tie up one FD for every dirty buffer, and you *certainly* don't want to get into a situation where you can't release kernel FDs until end of xact. You might be able to get around that by associating the fsync-needed bit with VFDs instead of FDs. What may turn out to be a nastier problem is the circular dependency this creates between shared-buffer management and md.c/fd.c. Right now (IIRC at 3am) md/fd are clearly at a lower level than bufmgr, but that would stop being true if you make FDs be proxies for dirtied buffers. Here is one off-the-top-of-the-head trouble scenario: bufmgr wants to dump a buffer that was dirtied by another backend -> needs to open FD -> fd.c has no free FDs, needs to close one -> needs to dump and fsync a buffer so it can forget the FD -> bufmgr needs to get I/O lock on two different buffers at once -> potential deadlock against another backend doing the reverse. (Assuming you even get that far, and don't hang up at the recursive entry to bufmgr trying to get a spinlock you already hold...) Possibly with close study you can prove that no such problem can happen. My point is just that this isn't a trivial change. Is it worth investing substantial effort on what will ultimately be a dead end? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am suggesting opening and marking a file descriptor as needing fsync > > even if I only dirty the buffer and not write it. I understand another > > backend may write my buffer and remove it before I commit my > > transaction. However, I will be the one to fsync it. I am also > > suggesting that such file descriptors never get recycled until > > transaction commit. > > > Is that wrong? > > I see where you're going, and you could possibly make it work, but > there are a bunch of problems. One objection is that kernel FDs > are a very finite resource on a lot of platforms --- you don't really > want to tie up one FD for every dirty buffer, and you *certainly* > don't want to get into a situation where you can't release kernel > FDs until end of xact. You might be able to get around that by > associating the fsync-needed bit with VFDs instead of FDs. OK, at least I was thinking correctly. Yes, there are serious drawbacks that make this pretty hard to implement. Unless Vadim revives this, we can drop it. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> BTW, Hiroshi has noticed me an excellent point #3: > > >Session-1 > >begin; > >update A ...; > > > >Session-2 > >begin; > >select * fromB ..; > > There's no PostgreSQL shared buffer available. > > This backend has to force the flush of a free buffer > > page. Unfortunately the page was dirtied by the > > above operation of Session-1 and calls pg_fsync() > > for the table A. However fsync() is postponed until > > commit of this backend. > > > >Session-1 > >commit; > > There's no dirty buffer page for the table A. > > So pg_fsync() isn't called for the table A. > > Seems there's no easy solution for this. Maybe now is the time to give > up my idea... Thinking about a little bit more, I have come across yet another possible solution. It is actually *very* simple. Details as follows. In xact.c:RecordTransactionCommit() there are two FlushBufferPool calls. One is for relation files and the other is for pg_log. I add sync() right after these FlushBufferPool. It will force any pending kernel buffers physically be written onto disk, thus should guarantee the ACID of the transaction (see attached code fragment). There are two things that we should worry about sync, however. 1. Does sync really wait for the completion of data be written on to disk? I looked into the man page of sync(2) on Linux 2.0.36: According to the standard specification (e.g., SVID), sync() schedules the writes, but may return beforethe actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still doesnot guarantee data integrity: modern disks have large caches.) It seems that sync(2) blocks until data is written. So it would be ok at least with Linux. I'm not sure about other platforms, though. 2. Are we suffered any performance penalty from sync? Since sync forces *all* dirty buffers on the system be written onto disk, it might be slower than fsync. So I did some testings using contrib/pgbench. Starting postmaster with -F on (and with sync modification), I ran 32 concurrent clients with performing 10 transactions each. In total 320 transactions are performed. Each transaction contains an UPDATE and a SELECT to a table that has 1000k tuples and an INSERT to another small table. The result showed that -F + sync was actually faster than the default mode (no -F, no modifications). The system is a Red Hat 5.2, with 128MB RAM. -F + sync normal mode -------------------------------------------------------- transactions/sec 3.46 2.93 Of course if there are disk activities other than PostgreSQL, sync would be suffered by it. However, in most cases the system is dedicated for only PostgreSQL, and I don't think this is a big problem in the real world. Note that for large COPY or INSERT was much faster than the normal mode due to no per-page-fsync. Thinking about all these, I would like to propose we add a new switch to postgres to run with -F + sync. ------------------------------------------------------------------------/* * If no one shared buffer was changed by thistransaction then * we don't flush shared buffers and don't record commit status. */if (SharedBufferChanged){ FlushBufferPool(); sync(); if (leak) ResetBufferPool(); /* * have the transaction access methods record the status * of this transaction id in the pg_log relation. */ TransactionIdCommit(xid); /* * Now write the log info to the disk too. */ leak = BufferPoolCheckLeak(); FlushBufferPool(); sync();}
* Tatsuo Ishii <t-ishii@sra.co.jp> [000209 00:51] wrote: > > BTW, Hiroshi has noticed me an excellent point #3: > > > > >Session-1 > > >begin; > > >update A ...; > > > > > >Session-2 > > >begin; > > >select * fromB ..; > > > There's no PostgreSQL shared buffer available. > > > This backend has to force the flush of a free buffer > > > page. Unfortunately the page was dirtied by the > > > above operation of Session-1 and calls pg_fsync() > > > for the table A. However fsync() is postponed until > > > commit of this backend. > > > > > >Session-1 > > >commit; > > > There's no dirty buffer page for the table A. > > > So pg_fsync() isn't called for the table A. > > > > Seems there's no easy solution for this. Maybe now is the time to give > > up my idea... > > Thinking about a little bit more, I have come across yet another > possible solution. It is actually *very* simple. Details as follows. > > In xact.c:RecordTransactionCommit() there are two FlushBufferPool > calls. One is for relation files and the other is for pg_log. I add > sync() right after these FlushBufferPool. It will force any pending > kernel buffers physically be written onto disk, thus should guarantee > the ACID of the transaction (see attached code fragment). > > There are two things that we should worry about sync, however. > > 1. Does sync really wait for the completion of data be written on to > disk? > > I looked into the man page of sync(2) on Linux 2.0.36: > > According to the standard specification (e.g., SVID), > sync() schedules the writes, but may return before the > actual writing is done. However, since version 1.3.20 > Linux does actually wait. (This still does not guarantee > data integrity: modern disks have large caches.) > > It seems that sync(2) blocks until data is written. So it would be ok > at least with Linux. I'm not sure about other platforms, though. It is incorrect to assume that sync() wait until all buffers are flushed on any other platform than Linux, I didn't think that Linux even did so but the kernel sources say yes. Solaris doesn't do this and niether does FreeBSD/NetBSD. I guess if you wanted to implement this for linux only then it would work, you ought to then also warn people that a non-dedicated db server could experiance different performance using this code. -Alfred
> > It seems that sync(2) blocks until data is written. So it would be ok > > at least with Linux. I'm not sure about other platforms, though. > > It is incorrect to assume that sync() wait until all buffers are > flushed on any other platform than Linux, I didn't think > that Linux even did so but the kernel sources say yes. Right. I have looked at Linux kernel sources and confirmed it. > Solaris doesn't do this and niether does FreeBSD/NetBSD. I'm not sure about Solaris since I don't have an access to its source codes. Will look at FreeBSD kernel sources. > I guess if you wanted to implement this for linux only then it would > work, you ought to then also warn people that a non-dedicated db server > could experiance different performance using this code. I just want to have more choices other than with/without -F. With -F looses ACID, without it implies per-page-fsync. Both choices are painful. But switching to expensive commercial DBMSs is much more painful at least for me. Even if it would be usefull on Linux only and in a certain situation, it would better than nothing IMHO (until WAL comes up). -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > [ use a global sync instead of fsync ] > 1. Does sync really wait for the completion of data be written on to > disk? Linux is *alone* among Unix platforms in waiting; every other implementation of sync() returns as soon as the last dirty buffer is scheduled to be written. > 2. Are we suffered any performance penalty from sync? A global sync at the completion of every xact would be disastrous for the performance of anything else on the system. > However, in most cases the system is dedicated for only PostgreSQL, "Most cases"? Do you have any evidence for that? regards, tom lane
> Thinking about a little bit more, I have come across yet another > possible solution. It is actually *very* simple. Details as follows. > > In xact.c:RecordTransactionCommit() there are two FlushBufferPool > calls. One is for relation files and the other is for pg_log. I add > sync() right after these FlushBufferPool. It will force any pending > kernel buffers physically be written onto disk, thus should guarantee > the ACID of the transaction (see attached code fragment). Interesting idea. I had proposed this solution long ago. My idea was to buffer pg_log writes every 30 seconds. Every 30 seconds, do a sync, then write/sync pg_log. Seemed like a good solution at the time, but Vadim didn't like it. I think he prefered to do logging, but honestly, it was over a year ago, and we could have been benefiting from it all this time. Second, I had another idea. What if we fsync()'ed a file descriptor only when we were writing the _last_ dirty buffer for that file. Seems in many cases this would be a win. I just don't know how hard that is to figure out. Seems there is no need to fsync() if we still have dirty buffers around. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Tatsuo Ishii <t-ishii@sra.co.jp> [000209 07:32] wrote: > > > It seems that sync(2) blocks until data is written. So it would be ok > > > at least with Linux. I'm not sure about other platforms, though. > > > > It is incorrect to assume that sync() wait until all buffers are > > flushed on any other platform than Linux, I didn't think > > that Linux even did so but the kernel sources say yes. > > Right. I have looked at Linux kernel sources and confirmed it. > > > Solaris doesn't do this and niether does FreeBSD/NetBSD. > > I'm not sure about Solaris since I don't have an access to its source > codes. Will look at FreeBSD kernel sources. > > > I guess if you wanted to implement this for linux only then it would > > work, you ought to then also warn people that a non-dedicated db server > > could experiance different performance using this code. > > I just want to have more choices other than with/without -F. With -F > looses ACID, without it implies per-page-fsync. Both choices are > painful. But switching to expensive commercial DBMSs is much more > painful at least for me. > > Even if it would be usefull on Linux only and in a certain situation, > it would better than nothing IMHO (until WAL comes up). Ok, here's a nifty idea, a slave process called pgsyncer. At the end of a transaction a backend asks the syncer to fsync all files. Now here's the cool part, this avoids the non-portability of the Linux sync() problem and at the same time restricts the syncing to postgresql and reduces 'cross-fsync' issues. Imagine: postgresql has 3 files open (a, b, c), so will the syncer. backend 1 completes a request, communicates to the syncer that a flush is needed. syncer starts by fsync'ing 'a' backend 2 completes a request, communicates to the syncer syncer continues with 'b' then 'c' syncer responds to backend 1 that it's safe to proceed. syncer fsyncs 'a' again syncer responds to backend 2 that it's all completed. effectively the fsync of 'b' and 'c' have been batched. It's just an elevator algorithm, perhaps this can be done without a seperate slave process? -Alfred
Alfred Perlstein <bright@wintelcom.net> writes: > postgresql has 3 files open (a, b, c), so will the syncer. The syncer must have all the files open that are open in any backend? What happens when it runs into the FDs-per-process limit? > backend 1 completes a request, communicates to the syncer that a flush > is needed. > syncer starts by fsync'ing 'a' > backend 2 completes a request, communicates to the syncer > syncer continues with 'b' then 'c' > syncer responds to backend 1 that it's safe to proceed. > syncer fsyncs 'a' again > syncer responds to backend 2 that it's all completed. > effectively the fsync of 'b' and 'c' have been batched. And it's safe to update pg_log when? I'm failing to see where the advantage is compared to the backends issuing their own fsyncs... regards, tom lane
> Ok, here's a nifty idea, a slave process called pgsyncer. > > At the end of a transaction a backend asks the syncer to fsync all files. > > Now here's the cool part, this avoids the non-portability of the Linux > sync() problem and at the same time restricts the syncing to postgresql > and reduces 'cross-fsync' issues. > > Imagine: > > postgresql has 3 files open (a, b, c), so will the syncer. > backend 1 completes a request, communicates to the syncer that a flush > is needed. > syncer starts by fsync'ing 'a' > backend 2 completes a request, communicates to the syncer > syncer continues with 'b' then 'c' > syncer responds to backend 1 that it's safe to proceed. > syncer fsyncs 'a' again > syncer responds to backend 2 that it's all completed. > > effectively the fsync of 'b' and 'c' have been batched. > > It's just an elevator algorithm, perhaps this can be done without > a seperate slave process? If you go to the hackers archive, you will see an implementation under subject "Bufferd loggins/pg_log" dated November 1997. We have gone over 2 years without this option, and it is going to be even longer before it is available via WAL. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: owner-pgsql-hackers@postgresql.org > [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tom Lane > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > [ use a global sync instead of fsync ] > > > 1. Does sync really wait for the completion of data be written on to > > disk? > > Linux is *alone* among Unix platforms in waiting; every other > implementation of sync() returns as soon as the last dirty buffer > is scheduled to be written. > > > 2. Are we suffered any performance penalty from sync? > > A global sync at the completion of every xact would be disastrous for > the performance of anything else on the system. > > > However, in most cases the system is dedicated for only PostgreSQL, > > "Most cases"? Do you have any evidence for that? > Tatsuo is afraid of the delay of WAL OTOH,it's not so easy to solve this item in current spec. Probably he wants a quick and simple solution. His solution is only for limited OS but is very simple. Moreover it would make FlushBufferPool() more reliable( I don't understand why FlushBufferPool() is allowed to not call fsync() per page.). The implementation would be in time for 7.0. Is a temporary option unitl WAL bad ? Regards. Hiroshi Inoue Inoue@tpf.co.jp