Thread: TODO item

TODO item

From

Tatsuo Ishii

Date:

05 February 2000, 05:34:03

In the TODO file:

* -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim)

Has this been done in current? I see almost no performance
improvement on copying data into a table.
--
Tatsuo Ishii

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

05 February 2000, 12:04:07

> In the TODO file:
> 
> * -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim)
> 
> Has this been done in current? I see almost no performance
> improvement on copying data into a table.

TODO updated.  That was part of MVCC which originally was supposed to be
in 7.0.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tatsuo Ishii

Date:

06 February 2000, 01:38:17

> > In the TODO file:
> > 
> > * -Allow transaction commits with rollback with no-fsync performance [fsync](Vadim)
> > 
> > Has this been done in current? I see almost no performance
> > improvement on copying data into a table.
> 
> TODO updated.  That was part of MVCC which originally was supposed to be
> in 7.0.

Thanks.

BTW, I have worked a little bit on this item. The idea is pretty
simple. Instead of doing a real fsync() in pg_fsync(), just marking it
so that we remember to do fsync() at the commit time. Following
patches illustrate the idea. An experience shows that it dramatically
boosts the performance of copy. Unfortunately I see virtually no
difference for TPC-B like small many concurrent transactions. Maybe we
would need WAL for this. Comments?

Index: access/transam/xact.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.60
diff -c -r1.60 xact.c
*** access/transam/xact.c    2000/01/29 16:58:29    1.60
--- access/transam/xact.c    2000/02/06 06:12:58
***************
*** 639,644 ****
--- 639,646 ----     if (SharedBufferChanged)     {         FlushBufferPool();
+         pg_fsync_pending();
+          if (leak)             ResetBufferPool(); 
***************
*** 653,658 ****
--- 655,661 ----          */         leak = BufferPoolCheckLeak();         FlushBufferPool();
+         pg_fsync_pending();     }      if (leak)
Index: storage/file/fd.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.52
diff -c -r1.52 fd.c
*** storage/file/fd.c    2000/01/26 05:56:55    1.52
--- storage/file/fd.c    2000/02/06 06:13:01
***************
*** 189,202 **** static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); static char
*filepath(char*filename); static long pg_nofile(void);  /*  * pg_fsync --- same as fsync except does nothing if -F
switchwas given  */ int pg_fsync(int fd) {
 
!     return disableFsync ? 0 : fsync(fd); }  /*
--- 189,238 ---- static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode); static char
*filepath(char*filename); static long pg_nofile(void);
 
+ static void alloc_fsync_info(void); 
+ static char *fsync_request;
+ static int nfds;
+  /*  * pg_fsync --- same as fsync except does nothing if -F switch was given  */ int pg_fsync(int fd)
+ {
+     if (fsync_request == NULL)
+       alloc_fsync_info();
+     fsync_request[fd] = 1;
+     return 0;
+ }
+ 
+ static void alloc_fsync_info(void)
+ {
+   nfds = pg_nofile();
+   fsync_request = malloc(nfds);
+   if (fsync_request == NULL) {
+     elog(ERROR, "alloc_fsync_info: cannot allocate memory");
+     return;
+   }
+ }
+ 
+ void
+ pg_fsync_pending(void) {
!   int i;
! 
!   if (disableFsync)
!     return;
! 
!   if (fsync_request == NULL)
!     alloc_fsync_info();
! 
!   for (i=0;i<nfds;i++) {
!     if (fsync_request[i]) {
!       fsync(i);
!       fsync_request[i] = 0;
!     }
!   } }  /*

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

06 February 2000, 02:01:18

> BTW, I have worked a little bit on this item. The idea is pretty
> simple. Instead of doing a real fsync() in pg_fsync(), just marking it
> so that we remember to do fsync() at the commit time. Following
> patches illustrate the idea. An experience shows that it dramatically
> boosts the performance of copy. Unfortunately I see virtually no
> difference for TPC-B like small many concurrent transactions. Maybe we
> would need WAL for this. Comments?


Can you be more specific.  How does fsync work now vs. your proposed
change.  I did not see that here.  Sorry.


> 
> Index: access/transam/xact.c
> ===================================================================
> RCS file: /usr/local/cvsroot/pgsql/src/backend/access/transam/xact.c,v
> retrieving revision 1.60
> diff -c -r1.60 xact.c
> *** access/transam/xact.c    2000/01/29 16:58:29    1.60
> --- access/transam/xact.c    2000/02/06 06:12:58
> ***************
> *** 639,644 ****
> --- 639,646 ----
>       if (SharedBufferChanged)
>       {
>           FlushBufferPool();
> +         pg_fsync_pending();
> + 
>           if (leak)
>               ResetBufferPool();
>   
> ***************
> *** 653,658 ****
> --- 655,661 ----
>            */
>           leak = BufferPoolCheckLeak();
>           FlushBufferPool();
> +         pg_fsync_pending();
>       }
>   
>       if (leak)
> Index: storage/file/fd.c
> ===================================================================
> RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/file/fd.c,v
> retrieving revision 1.52
> diff -c -r1.52 fd.c
> *** storage/file/fd.c    2000/01/26 05:56:55    1.52
> --- storage/file/fd.c    2000/02/06 06:13:01
> ***************
> *** 189,202 ****
>   static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode);
>   static char *filepath(char *filename);
>   static long pg_nofile(void);
>   
>   /*
>    * pg_fsync --- same as fsync except does nothing if -F switch was given
>    */
>   int
>   pg_fsync(int fd)
>   {
> !     return disableFsync ? 0 : fsync(fd);
>   }
>   
>   /*
> --- 189,238 ----
>   static File fileNameOpenFile(FileName fileName, int fileFlags, int fileMode);
>   static char *filepath(char *filename);
>   static long pg_nofile(void);
> + static void alloc_fsync_info(void);
>   
> + static char *fsync_request;
> + static int nfds;
> + 
>   /*
>    * pg_fsync --- same as fsync except does nothing if -F switch was given
>    */
>   int
>   pg_fsync(int fd)
> + {
> +     if (fsync_request == NULL)
> +       alloc_fsync_info();
> +     fsync_request[fd] = 1;
> +     return 0;
> + }
> + 
> + static void alloc_fsync_info(void)
> + {
> +   nfds = pg_nofile();
> +   fsync_request = malloc(nfds);
> +   if (fsync_request == NULL) {
> +     elog(ERROR, "alloc_fsync_info: cannot allocate memory");
> +     return;
> +   }
> + }
> + 
> + void
> + pg_fsync_pending(void)
>   {
> !   int i;
> ! 
> !   if (disableFsync)
> !     return;
> ! 
> !   if (fsync_request == NULL)
> !     alloc_fsync_info();
> ! 
> !   for (i=0;i<nfds;i++) {
> !     if (fsync_request[i]) {
> !       fsync(i);
> !       fsync_request[i] = 0;
> !     }
> !   }
>   }
>   
>   /*
> 


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: [HACKERS] TODO item

From

"Hiroshi Inoue"

Date:

06 February 2000, 03:37:18

> -----Original Message-----
> From: owner-pgsql-hackers@postgresql.org
> [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tatsuo Ishii
>
> > > In the TODO file:
> > >
> > > * -Allow transaction commits with rollback with no-fsync
> performance [fsync](Vadim)
> > >
> > > Has this been done in current? I see almost no performance
> > > improvement on copying data into a table.
> >
> > TODO updated.  That was part of MVCC which originally was supposed to be
> > in 7.0.
>
> Thanks.
>
> BTW, I have worked a little bit on this item. The idea is pretty
> simple. Instead of doing a real fsync() in pg_fsync(), just marking it
> so that we remember to do fsync() at the commit time. Following

This seems not good,unfortunately.
Note that the backend which calls pg_fsync() for a relation file may
be different from the backend which updated shared buffers of the file.
The former backend wouldn't necessarily be committed when the
latter backend is committed.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] TODO item

From

Tatsuo Ishii

Date:

06 February 2000, 09:01:22

> > BTW, I have worked a little bit on this item. The idea is pretty
> > simple. Instead of doing a real fsync() in pg_fsync(), just marking it
> > so that we remember to do fsync() at the commit time. Following
> > patches illustrate the idea. An experience shows that it dramatically
> > boosts the performance of copy. Unfortunately I see virtually no
> > difference for TPC-B like small many concurrent transactions. Maybe we
> > would need WAL for this. Comments?
> 
> 
> Can you be more specific.  How does fsync work now vs. your proposed
> change.  I did not see that here.  Sorry.

As already pointed out by many people, current buffer manager is not
very smart on flushing out dirty pages. From TODO.detail/fsync:

>This is the problem of buffer manager, known for very long time:
>when copy eats all buffers, manager begins write/fsync each
>durty buffer to free buffer for new data. All updated relations
>should be fsynced _once_ @ transaction commit. You would get
>the same results without -F...

With my changes, pg_fsync would just mark the relation (actually its
file descriptor) as it is needed fsync, instead of calling real fsync. 
Upon transaction commit, the mark would be checked and relations are
fsynced if necessary.

BTW, Hiroshi has raised a question with my changes, and I have written
to him (in Japanese, of course:-) to make sure that what I'm missing
here. I will let you know the result later.
--
Tatsuo Ishii

Re: [HACKERS] TODO item

From

Tom Lane

Date:

06 February 2000, 10:48:23

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>>>> BTW, I have worked a little bit on this item. The idea is pretty
>>>> simple. Instead of doing a real fsync() in pg_fsync(), just marking it
>>>> so that we remember to do fsync() at the commit time. Following
>>>> patches illustrate the idea.

In the form you have shown it, it would be completely useless, for
two reasons:

1. It doesn't guarantee that the right files are fsync'd.  It would
in fact fsync whichever files happen to be using the same kernel
file descriptor numbers at the close of the transaction as the ones
you really wanted to fsync were using at the time fsync was requested.

2. It doesn't guarantee that the files are fsync'd in the right order.
Per my discussion a few days ago, the only reason for doing fsync at all
is to guarantee that the data pages touched by a transaction get flushed
to disk before the pg_log update claiming that the transaction is done
gets flushed to disk.  A change like this completely destroys that
ordering, since pg_fsync_pending has no idea which fd is pg_log.

You could possibly fix #1 by logging fsync requests at the vfd level;
then, whenever a vfd is closed to free up a kernel fd, check the fsync
flag and execute the pending fsync before closing the file.  You could
possibly fix #2 by having transaction commit invoke the pg_fsync_pending
scan before it updates pg_log (and then fsyncing pg_log itself again
after).

(Actually, you could probably eliminate the notion of "fsync request"
entirely, and simply have each vfd get marked "dirty" automatically when
written to.  Both closing a vfd and the scan at xact commit would look
at the dirty bit to decide to do fsync.)

What would still need to be thought about is whether this scheme
preserves the ordering guarantee when a group of concurrent backends
is considered, rather than one backend in isolation.  (I believe that
fsync() will apply to all dirty kernel buffers for a file, not just
those dirtied by the requesting process, so each backend's fsyncs can
affect the order in which other backends' writes hit the disk.)
Offhand I do not see any problems there, but it's the kind of thing
that requires more than offhand thought...
        regards, tom lane

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

06 February 2000, 12:49:24

> You could possibly fix #1 by logging fsync requests at the vfd level;
> then, whenever a vfd is closed to free up a kernel fd, check the fsync
> flag and execute the pending fsync before closing the file.  You could
> possibly fix #2 by having transaction commit invoke the pg_fsync_pending
> scan before it updates pg_log (and then fsyncing pg_log itself again
> after).
> 
> (Actually, you could probably eliminate the notion of "fsync request"
> entirely, and simply have each vfd get marked "dirty" automatically when
> written to.  Both closing a vfd and the scan at xact commit would look
> at the dirty bit to decide to do fsync.)
> 
> What would still need to be thought about is whether this scheme
> preserves the ordering guarantee when a group of concurrent backends
> is considered, rather than one backend in isolation.  (I believe that
> fsync() will apply to all dirty kernel buffers for a file, not just
> those dirtied by the requesting process, so each backend's fsyncs can
> affect the order in which other backends' writes hit the disk.)
> Offhand I do not see any problems there, but it's the kind of thing
> that requires more than offhand thought...

Glad someone is looking into this.  Seems the above concern about order
is fine because it is only marking the pg_log transactions as committed
that is important.  You can fsync anything you want, you just need to
make sure you current transactions buffers are fsync'ed before you mark
the transaction as complete.  

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Hiroshi Inoue

Date:

06 February 2000, 19:40:29

Tom Lane wrote:

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >>>> BTW, I have worked a little bit on this item. The idea is pretty
> >>>> simple. Instead of doing a real fsync() in pg_fsync(), just marking it
> >>>> so that we remember to do fsync() at the commit time. Following
> >>>> patches illustrate the idea.
>
> What would still need to be thought about is whether this scheme
> preserves the ordering guarantee when a group of concurrent backends
> is considered, rather than one backend in isolation.  (I believe that
> fsync() will apply to all dirty kernel buffers for a file, not just
> those dirtied by the requesting process, so each backend's fsyncs can
> affect the order in which other backends' writes hit the disk.)
> Offhand I do not see any problems there, but it's the kind of thing
> that requires more than offhand thought...

The following is an example of what I first pointed out.
I say about PostgreSQL shared buffers not kernel buffers.

Session-1
begin;
update A ...;

Session-2
begin;
select * fromB ..;   There's no PostgreSQL shared buffer available.   This backend has to force the flush of a free
buffer  page. Unfortunately the page was dirtied by the   above operation of Session-1 and calls pg_fsync()   for the
tableA. However fsync() is postponed until   commit of this backend.

Session-1
commit;   There's no dirty buffer page for the table A.   So pg_fsync() isn't called for the table A.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] TODO item

From

Tatsuo Ishii

Date:

07 February 2000, 07:25:09

> 1. It doesn't guarantee that the right files are fsync'd.  It would
> in fact fsync whichever files happen to be using the same kernel
> file descriptor numbers at the close of the transaction as the ones
> you really wanted to fsync were using at the time fsync was requested.

Right. If a VFD is reused, the fd would not point to the same file
anymore.

> You could possibly fix #1 by logging fsync requests at the vfd level;
> then, whenever a vfd is closed to free up a kernel fd, check the fsync
> flag and execute the pending fsync before closing the file.  You could
> possibly fix #2 by having transaction commit invoke the pg_fsync_pending
> scan before it updates pg_log (and then fsyncing pg_log itself again
> after).

I do not understand #2. I call pg_fsync_pending twice in
RecordTransactionCommit, one is after FlushBufferPool, and the other
is after TansactionIdCommit and FlushBufferPool. Or am I missing
something?

> What would still need to be thought about is whether this scheme
> preserves the ordering guarantee when a group of concurrent backends
> is considered, rather than one backend in isolation.  (I believe that
> fsync() will apply to all dirty kernel buffers for a file, not just
> those dirtied by the requesting process, so each backend's fsyncs can
> affect the order in which other backends' writes hit the disk.)
> Offhand I do not see any problems there, but it's the kind of thing
> that requires more than offhand thought...

I thought about that too. If the ordering was that important, a
database managed by backends with -F on could be seriously
corrupted. I've never heard of such disasters caused by -F.  So my
conclusion was that it's safe or I had been so lucky. Note that I'm
not talking about pg_log vs. relations but the ordering among
relations.

BTW, Hiroshi has noticed me an excellent point #3:

>Session-1
>begin;
>update A ...;
>
>Session-2
>begin;
>select * fromB ..;
>    There's no PostgreSQL shared buffer available.
>    This backend has to force the flush of a free buffer
>    page. Unfortunately the page was dirtied by the
>    above operation of Session-1 and calls pg_fsync()
>    for the table A. However fsync() is postponed until
>    commit of this backend.
>
>Session-1
>commit;
>    There's no dirty buffer page for the table A.
>    So pg_fsync() isn't called for the table A.

Seems there's no easy solution for this. Maybe now is the time to give
up my idea...
--
Tatsuo Ishii

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 11:36:12

> BTW, Hiroshi has noticed me an excellent point #3:
> 
> >Session-1
> >begin;
> >update A ...;
> >
> >Session-2
> >begin;
> >select * fromB ..;
> >    There's no PostgreSQL shared buffer available.
> >    This backend has to force the flush of a free buffer
> >    page. Unfortunately the page was dirtied by the
> >    above operation of Session-1 and calls pg_fsync()
> >    for the table A. However fsync() is postponed until
> >    commit of this backend.
> >
> >Session-1
> >commit;
> >    There's no dirty buffer page for the table A.
> >    So pg_fsync() isn't called for the table A.
> 
> Seems there's no easy solution for this. Maybe now is the time to give
> up my idea...

I hate to see you give up on this.  

Don't tell me we fsync on every buffer write, and not just at
transaction commit?  That is terrible.

What if we set a flag on the file descriptor stating we dirtied/wrote
one of its buffers during the transaction, and cycle through the file
descriptors on buffer commit and fsync all involved in the transaction. 
We also fsync if we close a file descriptor and it was involved in the
transaction.  We clear the "involved in this transaction" flag on commit
too.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 11:41:12

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> possibly fix #2 by having transaction commit invoke the pg_fsync_pending
>> scan before it updates pg_log (and then fsyncing pg_log itself again
>> after).

> I do not understand #2. I call pg_fsync_pending twice in
> RecordTransactionCommit, one is after FlushBufferPool, and the other
> is after TansactionIdCommit and FlushBufferPool. Or am I missing
> something?

Oh, OK.  That's what I meant.  The snippet you posted didn't show where
you were calling the fsync routine from.

> I thought about that too. If the ordering was that important, a
> database managed by backends with -F on could be seriously
> corrupted. I've never heard of such disasters caused by -F.

This is why I think that fsync actually offers very little extra
protection ;-)

> BTW, Hiroshi has noticed me an excellent point #3:

>> This backend has to force the flush of a free buffer
>> page. Unfortunately the page was dirtied by the
>> above operation of Session-1 and calls pg_fsync()
>> for the table A. However fsync() is postponed until
>> commit of this backend.
>> 
>> Session-1
>> commit;
>> There's no dirty buffer page for the table A.
>> So pg_fsync() isn't called for the table A.

Oooh, right.  Backend A dirties the page, but leaves it sitting in
shared buffer.  Backend B needs the buffer space, so it does the
fwrite of the page.  Now if backend A wants to commit, it can fsync
everything it's written --- but does that guarantee the page that
was actually written by B will get flushed to disk?  Not sure.

If the pending-fsync logic is based on either physical fds or vfds
then it definitely *won't* work; A might have found the desired page
sitting in buffer cache to begin with, and never have opened the
underlying file at all!

So it seems you would need to keep a list of all the relation files (and
segments) you've written to in the current xact, and open and fsync each
one just before writing/fsyncing pg_log.  Even then, you're assuming
that fsync applied to a file via an fd belonging to one backend will
flush disk buffers written to the same file via *other* fds belonging
to *other* processes.  I'm not sure that that is true on all Unixes...
heck, I'm not sure it's true on any.  The fsync(2) man page here isn't
real specific.
        regards, tom lane

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 11:49:12

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Don't tell me we fsync on every buffer write, and not just at
> transaction commit?  That is terrible.

If you don't have -F set, yup.  Why did you think fsync mode was
so slow?

> What if we set a flag on the file descriptor stating we dirtied/wrote
> one of its buffers during the transaction, and cycle through the file
> descriptors on buffer commit and fsync all involved in the transaction. 

That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
has pointed out a serious problem that would make it unreliable when
multiple backends are running: if some *other* backend fwrites the page
instead of your backend, and it doesn't fsync until *its* transaction is
done (possibly long after yours), then you lose the ordering guarantee
that is the point of the whole exercise...
        regards, tom lane

Re: [HACKERS] TODO item

From

Don Baccus

Date:

07 February 2000, 11:59:12

At 11:31 AM 2/7/00 -0500, Bruce Momjian wrote:

>I hate to see you give up on this.  

>Don't tell me we fsync on every buffer write, and not just at
>transaction commit?  That is terrible.

Won't we have many more options in this area, i.e. increasing performance
while maintaining on-disk data integrity, once WAL is implemented?

snapshot+WAL = your database so in theory -F on tables and 
the transaction log would be safe as long as you have a snapshot and
as long as the WAL is being fsync'd and you have the disk space to
hold the WAL until you update your snapshot, no?

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 12:39:17

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Don't tell me we fsync on every buffer write, and not just at
> > transaction commit?  That is terrible.
> 
> If you don't have -F set, yup.  Why did you think fsync mode was
> so slow?
> 
> > What if we set a flag on the file descriptor stating we dirtied/wrote
> > one of its buffers during the transaction, and cycle through the file
> > descriptors on buffer commit and fsync all involved in the transaction. 
> 
> That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
> has pointed out a serious problem that would make it unreliable when
> multiple backends are running: if some *other* backend fwrites the page
> instead of your backend, and it doesn't fsync until *its* transaction is
> done (possibly long after yours), then you lose the ordering guarantee
> that is the point of the whole exercise...

OK, I understand now.  You are saying if my backend dirties a buffer,
but another backend does the write, would my backend fsync() that buffer
that the other backend wrote.

I can't imagine how fsync could flush _only_ the file discriptor buffers
modified by the current process.  It would have to affect all buffers
for the file descriptor.

BSDI says:
    Fsync() causes all modified data and attributes of fd to be moved to a    permanent storage device.  This normally
resultsin all in-core modified    copies of buffers for the associated file to be written to a disk.

Looking at the BSDI kernel, there is a user-mode file descriptor table,
which maps to a kernel file descriptor table.  This table can be shared,
so a file descriptor opened multiple times, like in a fork() call.  The
kernel table maps to an actual file inode/vnode that maps to a file. 
The only thing that is kept in the file descriptor table is the current
offset in the file (struct file in BSD).  There is no mapping of who
wrote which blocks.

In fact, I would suggest that any kernel implementation that could track
such things would be pretty broken.  I can imagine some cases the use of
that mapping of blocks to file descriptors would cause compatibility
problems.  Those buffers have to be shared by all processes.

So, I think we are safe if we can either keep that file descriptor open
until commit, or re-open it and fsync it on commit.  That assume a
re-open is hitting the same file.  My opinion is that we should just
fsync it on close and not worry about a reopen.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

fsync alternatives (was: Re: [HACKERS] TODO item)

From

Alfred Perlstein

Date:

07 February 2000, 13:14:13

* Bruce Momjian <pgman@candle.pha.pa.us> [000207 10:14] wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Don't tell me we fsync on every buffer write, and not just at
> > > transaction commit?  That is terrible.
> > 
> > If you don't have -F set, yup.  Why did you think fsync mode was
> > so slow?
> > 
> > > What if we set a flag on the file descriptor stating we dirtied/wrote
> > > one of its buffers during the transaction, and cycle through the file
> > > descriptors on buffer commit and fsync all involved in the transaction. 
> > 
> > That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
> > has pointed out a serious problem that would make it unreliable when
> > multiple backends are running: if some *other* backend fwrites the page
> > instead of your backend, and it doesn't fsync until *its* transaction is
> > done (possibly long after yours), then you lose the ordering guarantee
> > that is the point of the whole exercise...
> 
> OK, I understand now.  You are saying if my backend dirties a buffer,
> but another backend does the write, would my backend fsync() that buffer
> that the other backend wrote.
> 
> I can't imagine how fsync could flush _only_ the file discriptor buffers
> modified by the current process.  It would have to affect all buffers
> for the file descriptor.
> 
> BSDI says:
> 
>      Fsync() causes all modified data and attributes of fd to be moved to a
>      permanent storage device.  This normally results in all in-core modified
>      copies of buffers for the associated file to be written to a disk.
> 
> Looking at the BSDI kernel, there is a user-mode file descriptor table,
> which maps to a kernel file descriptor table.  This table can be shared,
> so a file descriptor opened multiple times, like in a fork() call.  The
> kernel table maps to an actual file inode/vnode that maps to a file. 
> The only thing that is kept in the file descriptor table is the current
> offset in the file (struct file in BSD).  There is no mapping of who
> wrote which blocks.
> 
> In fact, I would suggest that any kernel implementation that could track
> such things would be pretty broken.  I can imagine some cases the use of
> that mapping of blocks to file descriptors would cause compatibility
> problems.  Those buffers have to be shared by all processes.
> 
> So, I think we are safe if we can either keep that file descriptor open
> until commit, or re-open it and fsync it on commit.  That assume a
> re-open is hitting the same file.  My opinion is that we should just
> fsync it on close and not worry about a reopen.

I'm pretty sure that the standard is that a close on a file _should_
fsync it.

In re the fsync problems...

I came across this option when investigating implementing range fsync()
for FreeBSD, 'O_FSYNC'/'O_SYNC'.

Why not keep 2 file descritors open for each datafile, one opened
with O_FSYNC (exists but not documented in FreeBSD) and one normal?
This garantees sync writes for all write operations on that fd.

Most unicies offer an open flag for this type of access although the name
will vary (Linux/Solaris uses O_SYNC afaik).

When a sync write is needed then use that filedescriptor to do the writing,
and use the normal one for non-sync writes.

This would fix the problem where another backend causes an out-of-order
or unsafe fsync to occur.

Another option is using mmap() and msync() to achive the same effect, the
only problem with mmap() is that under most i386 systems you are limited
to a < 4gig (2gig with FreeBSD) mapping that would have to be 'windowed'
over the datafiles, however depending on the locality of accesses this
may be much more effecient that read/write semantics.
Not to mention that a lot of unicies have broken mmap() implementations
and problems with merged vm/buffercache.

Yes, I haven't looked at the backend code, just hoping to offer some 
useful suggestions.

-Alfred

Re: fsync alternatives (was: Re: [HACKERS] TODO item)

From

Bruce Momjian

Date:

07 February 2000, 13:32:13

> > So, I think we are safe if we can either keep that file descriptor open
> > until commit, or re-open it and fsync it on commit.  That assume a
> > re-open is hitting the same file.  My opinion is that we should just
> > fsync it on close and not worry about a reopen.
> 
> I'm pretty sure that the standard is that a close on a file _should_
> fsync it.

This is not true.  close flushes the user buffers to kernel buffers.  It
does not force to physical disk in all cases, I think.  There is really
no need to force them to disk on close.  The only time they have to be
forced to disk is when the system shuts down, or on an fsync call.

> 
> In re the fsync problems...
> 
> I came across this option when investigating implementing range fsync()
> for FreeBSD, 'O_FSYNC'/'O_SYNC'.
> 
> Why not keep 2 file descritors open for each datafile, one opened
> with O_FSYNC (exists but not documented in FreeBSD) and one normal?
> This garantees sync writes for all write operations on that fd.

We actually don't want this.  We like to just fsync the file descriptor
and retroactively fsync all our writes.  fsync allows us to decouple the
write and the fsync, which is what we really are attempting to do.  Our
current behavour is to do write/fsync together, which is wasteful.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: fsync alternatives (was: Re: [HACKERS] TODO item)

From

Alfred Perlstein

Date:

07 February 2000, 13:54:14

* Bruce Momjian <pgman@candle.pha.pa.us> [000207 11:00] wrote:
> > > So, I think we are safe if we can either keep that file descriptor open
> > > until commit, or re-open it and fsync it on commit.  That assume a
> > > re-open is hitting the same file.  My opinion is that we should just
> > > fsync it on close and not worry about a reopen.
> > 
> > I'm pretty sure that the standard is that a close on a file _should_
> > fsync it.
> 
> This is not true.  close flushes the user buffers to kernel buffers.  It
> does not force to physical disk in all cases, I think.  There is really
> no need to force them to disk on close.  The only time they have to be
> forced to disk is when the system shuts down, or on an fsync call.
> 
> > 
> > In re the fsync problems...
> > 
> > I came across this option when investigating implementing range fsync()
> > for FreeBSD, 'O_FSYNC'/'O_SYNC'.
> > 
> > Why not keep 2 file descritors open for each datafile, one opened
> > with O_FSYNC (exists but not documented in FreeBSD) and one normal?
> > This garantees sync writes for all write operations on that fd.
> 
> We actually don't want this.  We like to just fsync the file descriptor
> and retroactively fsync all our writes.  fsync allows us to decouple the
> write and the fsync, which is what we really are attempting to do.  Our
> current behavour is to do write/fsync together, which is wasteful.

Yes, the way I understand it is that one backend doing the fsync
will sync the entire file perhaps forcing a sync in the middle of
a somewhat critical update being done by another instance of the
backend.

Since the current behavior seems to be write/fsync/write/fsync...
instead of write/write/write/fsync you may as well try opening the
filedescriptor with O_FSYNC on operating systems that support it to
avoid the cross-fsync problem.

Another option is to use O_FSYNC descriptiors and aio_write to
allow a sync writes to be 'backgrounded'.  More and more unix OS's
are supporting aio nowadays.

I'm aware of the performance implications sync writes cause, but
using fsync after every write seems to cause massive amounts of
unessesary disk IO that could be avoided with using explicit
sync descriptors with little increase in complexity considering
what I understand of the current implementation.

Basically it would seem to be a good hack until you get the algorithm
to batch fsyncs working. (write/write/write.../fsync)  At that point
you may want to window over the files using msync(), but there may
be a better way, one that allows a vector of io to be scheduled for
sync write in one go, rather than a buffer at a time.

-Alfred

Re: fsync alternatives (was: Re: [HACKERS] TODO item)

From

Bruce Momjian

Date:

07 February 2000, 13:58:14

> Yes, the way I understand it is that one backend doing the fsync
> will sync the entire file perhaps forcing a sync in the middle of
> a somewhat critical update being done by another instance of the
> backend.

We don't mind that.  Until the transaction is marked as complete, they
can fsync anything we want.  We just want all stuff modified by a 
transaction fsynced before a transaction is marked as completed.

> I'm aware of the performance implications sync writes cause, but
> using fsync after every write seems to cause massive amounts of
> unessesary disk IO that could be avoided with using explicit
> sync descriptors with little increase in complexity considering
> what I understand of the current implementation.

Yes.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 18:18:20

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I can't imagine how fsync could flush _only_ the file discriptor buffers
> modified by the current process.  It would have to affect all buffers
> for the file descriptor.

Yeah, you're probably right.  After thinking about it, I can't believe
that a disk block buffer inside the kernel has any record of which FD
it was written by (after all, it could have been dirtied through more
than one FD since it was last synced to disk).  All it's got is a file
inode number and a block number within the file.  Presumably fsync()
searches the buffer cache for blocks that match the FD's inode number
and schedules I/O for all the ones that are dirty.

> So, I think we are safe if we can either keep that file descriptor open
> until commit, or re-open it and fsync it on commit.  That assume a
> re-open is hitting the same file.  My opinion is that we should just
> fsync it on close and not worry about a reopen.

There's still the problem that your backend might never have opened the
relation file at all, still less done a write through its fd or vfd.
I think we would need to have a separate data structure saying "these
relations were dirtied in the current xact" that is not tied to fd's or
vfd's.  Maybe the relcache would be a good place to keep such a flag.

Transaction commit would look like:

* scan buffer cache for dirty buffers, fwrite each one that belongs
to one of the relations I'm trying to commit;

* open and fsync each segment of each rel that I'm trying to commit
(or maybe just the dirtied segments, if we want to do the bookkeeping
at that level of detail);

* make pg_log entry;

* write and fsync pg_log.

fsync-on-close is probably a waste of cycles.  The only way that would
matter is if someone else were doing a RENAME TABLE on the rel, thus
preventing you from reopening it.  I think we could just put the
responsibility on the renamer to fsync the file while he's doing it
(in fact I think that's already in there, at least to the extent of
flushing the buffer cache).
        regards, tom lane

RE: [HACKERS] TODO item

From

"Hiroshi Inoue"

Date:

07 February 2000, 18:33:18

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
>
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Don't tell me we fsync on every buffer write, and not just at
> > > transaction commit?  That is terrible.
> >
> > If you don't have -F set, yup.  Why did you think fsync mode was
> > so slow?
> >
> > > What if we set a flag on the file descriptor stating we dirtied/wrote
> > > one of its buffers during the transaction, and cycle through the file
> > > descriptors on buffer commit and fsync all involved in the
> transaction.
> >
> > That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
> > has pointed out a serious problem that would make it unreliable when
> > multiple backends are running: if some *other* backend fwrites the page
> > instead of your backend, and it doesn't fsync until *its* transaction is
> > done (possibly long after yours), then you lose the ordering guarantee
> > that is the point of the whole exercise...
>
> OK, I understand now.  You are saying if my backend dirties a buffer,
> but another backend does the write, would my backend fsync() that buffer
> that the other backend wrote.
>
> I can't imagine how fsync could flush _only_ the file discriptor buffers
> modified by the current process.  It would have to affect all buffers
> for the file descriptor.
>
> BSDI says:
>
>      Fsync() causes all modified data and attributes of fd to be
> moved to a
>      permanent storage device.  This normally results in all
> in-core modified
>      copies of buffers for the associated file to be written to a disk.
>
> Looking at the BSDI kernel, there is a user-mode file descriptor table,
> which maps to a kernel file descriptor table.  This table can be shared,
> so a file descriptor opened multiple times, like in a fork() call.  The
> kernel table maps to an actual file inode/vnode that maps to a file.
> The only thing that is kept in the file descriptor table is the current
> offset in the file (struct file in BSD).  There is no mapping of who
> wrote which blocks.
>
> In fact, I would suggest that any kernel implementation that could track
> such things would be pretty broken.  I can imagine some cases the use of
> that mapping of blocks to file descriptors would cause compatibility
> problems.  Those buffers have to be shared by all processes.
>
> So, I think we are safe if we can either keep that file descriptor open
> until commit, or re-open it and fsync it on commit.  That assume a
> re-open is hitting the same file.  My opinion is that we should just
> fsync it on close and not worry about a reopen.
>

I asked about this question 4 months ago but got no answer.
Obviouly this needs not only md/fd stuff changes but also bufmgr
changes.  Keeping dirtied list of segments of each backend seems
to work. But I'm afraid of other oversights.

The problem is that this feature is very difficult to verify.
In addtion WAL would solve this item naturally.

Is it still valuable to solve this item in current spec ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 18:35:17

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Is it still valuable to solve this item in current spec ?

I'd be inclined to forget about it for now, and see what happens
with WAL.  It looks like a fair amount of work for a problem that
will go away anyway in a release or so...
        regards, tom lane

Re: fsync alternatives (was: Re: [HACKERS] TODO item)

From

Chris Bitmead

Date:

07 February 2000, 18:37:17

Bruce Momjian wrote:
> 
> > > So, I think we are safe if we can either keep that file descriptor open
> > > until commit, or re-open it and fsync it on commit.  That assume a
> > > re-open is hitting the same file.  My opinion is that we should just
> > > fsync it on close and not worry about a reopen.
> >
> > I'm pretty sure that the standard is that a close on a file _should_
> > fsync it.
> 
> This is not true.  close flushes the user buffers to kernel buffers.  It
> does not force to physical disk in all cases, I think.  

fclose flushes user buffers to kernel buffers. close only frees the file
descriptor for re-use.

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 19:05:17

> > So, I think we are safe if we can either keep that file descriptor open
> > until commit, or re-open it and fsync it on commit.  That assume a
> > re-open is hitting the same file.  My opinion is that we should just
> > fsync it on close and not worry about a reopen.
> 
> There's still the problem that your backend might never have opened the
> relation file at all, still less done a write through its fd or vfd.
> I think we would need to have a separate data structure saying "these
> relations were dirtied in the current xact" that is not tied to fd's or
> vfd's.  Maybe the relcache would be a good place to keep such a flag.
> 
> Transaction commit would look like:
> 
> * scan buffer cache for dirty buffers, fwrite each one that belongs
> to one of the relations I'm trying to commit;
> 
> * open and fsync each segment of each rel that I'm trying to commit
> (or maybe just the dirtied segments, if we want to do the bookkeeping
> at that level of detail);

By fsync'ing on close, we can not worry about file descriptors that were
forced out of the file descriptor cache during the transaction.

If we dirty a buffer, we have to mark the buffer as dirty, and the file
descriptor associated with that buffer needing fsync.  If someone else
writes and removes that buffer from the cache before we get to commit
it, the file descriptor flag will tell us the file descriptor needs
fsync.

We have to:
write our dirty buffersfsync all file descriptors marked as "written" during our transactionfsync all file descriptors
onclose when being cycled out of fd cache(fd close has to write dirty buffers before fsync)
 

So we have three states for a write:
still in dirty bufferfile descriptor marked as dirty/need fsyncfile descriptor removed from cache, fsync'ed on close

Seems this covers all the cases.

> 
> * make pg_log entry;
> 
> * write and fsync pg_log.

Yes.

> 
> fsync-on-close is probably a waste of cycles.  The only way that would
> matter is if someone else were doing a RENAME TABLE on the rel, thus
> preventing you from reopening it.  I think we could just put the
> responsibility on the renamer to fsync the file while he's doing it
> (in fact I think that's already in there, at least to the extent of
> flushing the buffer cache).

I hadn't thought of that case. I was thinking of file descriptor cache
removal, or don't they get removed if they are in use?  If not, you can
skip my close examples.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 19:08:17

> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > Is it still valuable to solve this item in current spec ?
> 
> I'd be inclined to forget about it for now, and see what happens
> with WAL.  It looks like a fair amount of work for a problem that
> will go away anyway in a release or so...

But is seems Tatsuo is pretty close to it.  I personally would like to
see it in 7.0.  Even with WAL, we may decide to allow non-WAL mode, and
if so, this code would still be useful.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 19:08:20

> > So, I think we are safe if we can either keep that file descriptor open
> > until commit, or re-open it and fsync it on commit.  That assume a
> > re-open is hitting the same file.  My opinion is that we should just
> > fsync it on close and not worry about a reopen.
> >
> 
> I asked about this question 4 months ago but got no answer.
> Obviouly this needs not only md/fd stuff changes but also bufmgr
> changes.  Keeping dirtied list of segments of each backend seems
> to work. But I'm afraid of other oversights.

I don't think so.  We can just mark file descriptors as needing fsync().
By doing that, we can spin through the buffer cache for each need_fsync
file desciptor, perform any writes needed, and fsync the descriptor. 
Seems like little redesign needed, except for adding the need_fsync
flag.  Should be no more than about 20 lines.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: [HACKERS] TODO item

From

"Hiroshi Inoue"

Date:

07 February 2000, 20:00:18

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
> 
> > > So, I think we are safe if we can either keep that file 
> descriptor open
> > > until commit, or re-open it and fsync it on commit.  That assume a
> > > re-open is hitting the same file.  My opinion is that we should just
> > > fsync it on close and not worry about a reopen.
> > 
> > There's still the problem that your backend might never have opened the
> > relation file at all, still less done a write through its fd or vfd.
> > I think we would need to have a separate data structure saying "these
> > relations were dirtied in the current xact" that is not tied to fd's or
> > vfd's.  Maybe the relcache would be a good place to keep such a flag.
> > 
> > Transaction commit would look like:
> > 
> > * scan buffer cache for dirty buffers, fwrite each one that belongs
> > to one of the relations I'm trying to commit;
> > 
> > * open and fsync each segment of each rel that I'm trying to commit
> > (or maybe just the dirtied segments, if we want to do the bookkeeping
> > at that level of detail);
> 
> By fsync'ing on close, we can not worry about file descriptors that were
> forced out of the file descriptor cache during the transaction.
> 
> If we dirty a buffer, we have to mark the buffer as dirty, and the file
> descriptor associated with that buffer needing fsync.  If someone else

What is the file descriptors associated with buffers ?
Would you call heap_open() etc each time when a buffer is about 
to be dirtied?

I don't object to you strongly but I ask again. 

There's already -F option for speeding up.
Who would want non-WAL mode with strict reliabilty after WAL
is implemented ?
Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

RE: [HACKERS] TODO item

From

Don Baccus

Date:

07 February 2000, 20:10:18

At 10:04 AM 2/8/00 +0900, Hiroshi Inoue wrote:

>There's already -F option for speeding up.
>Who would want non-WAL mode with strict reliabilty after WAL
>is implemented ?

Exactly.  I suspect WAL will actually run faster, or at least
will have that potential when its existence is fully exploited,
than non-WAL non -F.

And it seems to me that touching something as crucial as
disk management in a fundamental way one week before the
release of a hopefully solid beta is pushing things a bit.

But, then again, I'm the resident paranoid conservative, I
guess.

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

07 February 2000, 22:01:19

> > > * open and fsync each segment of each rel that I'm trying to commit
> > > (or maybe just the dirtied segments, if we want to do the bookkeeping
> > > at that level of detail);
> > 
> > By fsync'ing on close, we can not worry about file descriptors that were
> > forced out of the file descriptor cache during the transaction.
> > 
> > If we dirty a buffer, we have to mark the buffer as dirty, and the file
> > descriptor associated with that buffer needing fsync.  If someone else
> 
> What is the file descriptors associated with buffers ?
> Would you call heap_open() etc each time when a buffer is about 
> to be dirtied?

WriteBuffer -> FlushBuffer to flush buffer.  Buffer can be either marked
dirty or written/fsync to disk.

If written/fsync, smgr_flush -> mdflush -> _mdfd_getseg gets MdfdVec
structure of file descriptor.  

When doing flush here, mark MdfdVec structure new element needs_fsync to
true.  Don't do fsync yet.

If just marked dirty, also mark MdfdVec.needs_fsync as true.

Do we currently all write dirty buffers on transaction commit?  We
certainly must already do that in fsync mode.

On commit, run through virtial file descriptor table and do fsyncs on
file descriptors.  No need to find the buffers attached to file
descriptors.  They have already been written by other code.  They just
need fsync.


> There's already -F option for speeding up.
> Who would want non-WAL mode with strict reliabilty after WAL
> is implemented ?

Let's see what Vadim says.  Seems like a nice performance boost and 7.0
could be 6 months away.  If we didn't ship with fsync enabled, I
wouldn't care.  Also, Vadim has a new job, so we really can't be sure
about WAL in 7.1.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 22:01:19

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> If we dirty a buffer, we have to mark the buffer as dirty, and the file
>> descriptor associated with that buffer needing fsync.  If someone else

> What is the file descriptors associated with buffers ?

I was about to make exactly that remark.  A shared buffer doesn't have
an "associated file descriptor", certainly not one that's valid across
multiple backends.

AFAICS no bookkeeping based on file descriptors (either kernel FDs
or vfds) can possibly work correctly in the multiple-backend case.
We *have to* do the bookkeeping on a relation basis, and that
potentially means (re)opening the relation's file at xact commit
in order to do an fsync.  There is no value in having one backend
fsync an FD before closing the FD, because that does not take
account of what other backends may have done or do later with that
same file through their own FDs for it.  If we do not do an fsync
at end of transaction, we cannot be sure that writes initiated by
*other* backends will be complete.

> There's already -F option for speeding up.
> Who would want non-WAL mode with strict reliabilty after WAL
> is implemented ?

Yes.  We have a better solution in the pipeline, so ISTM it's not
worth expending a lot of effort on a stopgap.
        regards, tom lane

RE: [HACKERS] TODO item

From

"Hiroshi Inoue"

Date:

07 February 2000, 22:23:20

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> 
> wouldn't care.  Also, Vadim has a new job, so we really can't be sure
> about WAL in 7.1.
>

Oops,it's a big problem.
If so,we may have to do something about this item.
However it seems too late for 7.0.
This isn't a kind of item which beta could verify.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

Re: [HACKERS] TODO item

From

Tom Lane

Date:

07 February 2000, 22:27:20

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Seems like little redesign needed, except for adding the need_fsync
> flag.  Should be no more than about 20 lines.

If you think this is a twenty line fix, take a deep breath and back
away slowly.  You have not understood the problem.

The problem comes in when *some other* backend has written out a
shared buffer that contained a change that our backend made as part
of the transaction that it now wants to commit.  Without immediate-
fsync-on-write (the current solution), there is no guarantee that the
other backend will do an fsync any time soon; it might be busy in
a very long-running transaction.  Our backend must fsync that file,
and it must do so after the other backend flushed the buffer.  But
there is no existing data structure that our backend can use to
discover that it must do this.  The shared buffer cannot record it;
it might belong to some other file entirely by now (and in any case,
the shared buffer is noplace to record per-transaction status info).
Our backend cannot use either FD or VFD to record it, since it might
never have opened the relation file at all, and certainly might have
closed it again (and recycled the FD or VFD) before the other backend
flushed the shared buffer.  The relcache might possibly work as a
place to record the need for fsync --- but I am concerned about the
relcache's willingness to drop entries if they are not currently
heap_open'd; also, md/fd don't currently use the relcache at all.

This is not a trivial change.
        regards, tom lane

Re: [HACKERS] TODO item

From

Don Baccus

Date:

08 February 2000, 00:16:21

At 10:26 PM 2/7/00 -0500, Tom Lane wrote:
>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Seems like little redesign needed, except for adding the need_fsync
>> flag.  Should be no more than about 20 lines.
>
>If you think this is a twenty line fix, take a deep breath and back
>away slowly.  You have not understood the problem.

And, again, thank you.

>This is not a trivial change.

I was actually through that code months ago, wondering why (ahem)
PG was so stupid about disk I/O and reached the same conclusion.

Therefore, I was more than pleased when a simple fix to get rid
of fsync's on read-only transactions arose.  In my application
space, this alone gave a huge performance boost.

WAL...that's it.  If Vadim is going to be unavailable because
of his new job, we'll need to figure out another way to do it.

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] TODO item

From

Don Baccus

Date:

08 February 2000, 00:16:21

At 10:00 PM 2/7/00 -0500, Tom Lane wrote:
>"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

>> There's already -F option for speeding up.
>> Who would want non-WAL mode with strict reliabilty after WAL
>> is implemented ?

>Yes.  We have a better solution in the pipeline, so ISTM it's not
>worth expending a lot of effort on a stopgap.

Thanks to both of you. 



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

08 February 2000, 02:02:22

> The problem comes in when *some other* backend has written out a
> shared buffer that contained a change that our backend made as part
> of the transaction that it now wants to commit.  Without immediate-
> fsync-on-write (the current solution), there is no guarantee that the
> other backend will do an fsync any time soon; it might be busy in
> a very long-running transaction.  Our backend must fsync that file,
> and it must do so after the other backend flushed the buffer.  But
> there is no existing data structure that our backend can use to
> discover that it must do this.  The shared buffer cannot record it;
> it might belong to some other file entirely by now (and in any case,
> the shared buffer is noplace to record per-transaction status info).
> Our backend cannot use either FD or VFD to record it, since it might
> never have opened the relation file at all, and certainly might have
> closed it again (and recycled the FD or VFD) before the other backend
> flushed the shared buffer.  The relcache might possibly work as a
> place to record the need for fsync --- but I am concerned about the
> relcache's willingness to drop entries if they are not currently
> heap_open'd; also, md/fd don't currently use the relcache at all.

OK, I will admit I must be wrong, but I would like to understand why.

I am suggesting opening and marking a file descriptor as needing fsync
even if I only dirty the buffer and not write it.  I understand another
backend may write my buffer and remove it before I commit my
transaction.  However, I will be the one to fsync it.  I am also
suggesting that such file descriptors never get recycled until
transaction commit.

Is that wrong?

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tom Lane

Date:

08 February 2000, 03:26:23

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am suggesting opening and marking a file descriptor as needing fsync
> even if I only dirty the buffer and not write it.  I understand another
> backend may write my buffer and remove it before I commit my
> transaction.  However, I will be the one to fsync it.  I am also
> suggesting that such file descriptors never get recycled until
> transaction commit.

> Is that wrong?

I see where you're going, and you could possibly make it work, but
there are a bunch of problems.  One objection is that kernel FDs
are a very finite resource on a lot of platforms --- you don't really
want to tie up one FD for every dirty buffer, and you *certainly*
don't want to get into a situation where you can't release kernel
FDs until end of xact.  You might be able to get around that by
associating the fsync-needed bit with VFDs instead of FDs.

What may turn out to be a nastier problem is the circular dependency
this creates between shared-buffer management and md.c/fd.c.  Right now
(IIRC at 3am) md/fd are clearly at a lower level than bufmgr, but that
would stop being true if you make FDs be proxies for dirtied buffers.
Here is one off-the-top-of-the-head trouble scenario: bufmgr wants to
dump a buffer that was dirtied by another backend -> needs to open FD ->
fd.c has no free FDs, needs to close one -> needs to dump and fsync a
buffer so it can forget the FD -> bufmgr needs to get I/O lock on two
different buffers at once -> potential deadlock against another backend
doing the reverse.  (Assuming you even get that far, and don't hang up
at the recursive entry to bufmgr trying to get a spinlock you already
hold...)

Possibly with close study you can prove that no such problem can happen.
My point is just that this isn't a trivial change.  Is it worth
investing substantial effort on what will ultimately be a dead end?
        regards, tom lane

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

08 February 2000, 04:34:24

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am suggesting opening and marking a file descriptor as needing fsync
> > even if I only dirty the buffer and not write it.  I understand another
> > backend may write my buffer and remove it before I commit my
> > transaction.  However, I will be the one to fsync it.  I am also
> > suggesting that such file descriptors never get recycled until
> > transaction commit.
> 
> > Is that wrong?
> 
> I see where you're going, and you could possibly make it work, but
> there are a bunch of problems.  One objection is that kernel FDs
> are a very finite resource on a lot of platforms --- you don't really
> want to tie up one FD for every dirty buffer, and you *certainly*
> don't want to get into a situation where you can't release kernel
> FDs until end of xact.  You might be able to get around that by
> associating the fsync-needed bit with VFDs instead of FDs.

OK, at least I was thinking correctly.  Yes, there are serious drawbacks
that make this pretty hard to implement.  Unless Vadim revives this, we
can drop it.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Tatsuo Ishii

Date:

09 February 2000, 03:18:40

> BTW, Hiroshi has noticed me an excellent point #3:
> 
> >Session-1
> >begin;
> >update A ...;
> >
> >Session-2
> >begin;
> >select * fromB ..;
> >    There's no PostgreSQL shared buffer available.
> >    This backend has to force the flush of a free buffer
> >    page. Unfortunately the page was dirtied by the
> >    above operation of Session-1 and calls pg_fsync()
> >    for the table A. However fsync() is postponed until
> >    commit of this backend.
> >
> >Session-1
> >commit;
> >    There's no dirty buffer page for the table A.
> >    So pg_fsync() isn't called for the table A.
> 
> Seems there's no easy solution for this. Maybe now is the time to give
> up my idea...

Thinking about a little bit more, I have come across yet another
possible solution. It is actually *very* simple. Details as follows.

In xact.c:RecordTransactionCommit() there are two FlushBufferPool
calls. One is for relation files and the other is for pg_log. I add
sync() right after these FlushBufferPool. It will force any pending
kernel buffers physically be written onto disk, thus should guarantee
the ACID of the transaction (see attached code fragment).

There are two things that we should worry about sync, however.

1. Does sync really wait for the completion of data be written on to
disk?

I looked into the man page of sync(2) on Linux 2.0.36:
      According to  the  standard  specification  (e.g.,  SVID),      sync()  schedules  the  writes,  but may return
beforethe      actual writing is done.   However,  since  version  1.3.20      Linux  does actually wait.  (This still
doesnot guarantee      data integrity: modern disks have large caches.)
 

It seems that sync(2) blocks until data is written. So it would be ok
at least with Linux. I'm not sure about other platforms, though.

2. Are we suffered any performance penalty from sync?

Since sync forces *all* dirty buffers on the system be written onto
disk, it might be slower than fsync. So I did some testings using
contrib/pgbench. Starting postmaster with -F on (and with sync
modification), I ran 32 concurrent clients with performing 10
transactions each. In total 320 transactions are performed. Each
transaction contains an UPDATE and a SELECT to a table that has 1000k
tuples and an INSERT to another small table. The result showed that -F
+ sync was actually faster than the default mode (no -F, no
modifications). The system is a Red Hat 5.2, with 128MB RAM.
        -F + sync    normal mode
--------------------------------------------------------
transactions/sec    3.46        2.93

Of course if there are disk activities other than PostgreSQL, sync
would be suffered by it. However, in most cases the system is
dedicated for only PostgreSQL, and I don't think this is a big problem
in the real world.

Note that for large COPY or INSERT was much faster than the normal
mode due to no per-page-fsync.

Thinking about all these, I would like to propose we add a new switch
to postgres to run with -F + sync.

------------------------------------------------------------------------/* * If no one shared buffer was changed by
thistransaction then * we don't flush shared buffers and don't record commit status. */if (SharedBufferChanged){
FlushBufferPool();   sync();    if (leak)        ResetBufferPool();
 
    /*     *    have the transaction access methods record the status     *    of this transaction id in the pg_log
relation.    */    TransactionIdCommit(xid);
 
    /*     *    Now write the log info to the disk too.     */    leak = BufferPoolCheckLeak();    FlushBufferPool();
sync();}

Re: [HACKERS] TODO item

From

Alfred Perlstein

Date:

09 February 2000, 04:38:41

* Tatsuo Ishii <t-ishii@sra.co.jp> [000209 00:51] wrote:
> > BTW, Hiroshi has noticed me an excellent point #3:
> > 
> > >Session-1
> > >begin;
> > >update A ...;
> > >
> > >Session-2
> > >begin;
> > >select * fromB ..;
> > >    There's no PostgreSQL shared buffer available.
> > >    This backend has to force the flush of a free buffer
> > >    page. Unfortunately the page was dirtied by the
> > >    above operation of Session-1 and calls pg_fsync()
> > >    for the table A. However fsync() is postponed until
> > >    commit of this backend.
> > >
> > >Session-1
> > >commit;
> > >    There's no dirty buffer page for the table A.
> > >    So pg_fsync() isn't called for the table A.
> > 
> > Seems there's no easy solution for this. Maybe now is the time to give
> > up my idea...
> 
> Thinking about a little bit more, I have come across yet another
> possible solution. It is actually *very* simple. Details as follows.
> 
> In xact.c:RecordTransactionCommit() there are two FlushBufferPool
> calls. One is for relation files and the other is for pg_log. I add
> sync() right after these FlushBufferPool. It will force any pending
> kernel buffers physically be written onto disk, thus should guarantee
> the ACID of the transaction (see attached code fragment).
> 
> There are two things that we should worry about sync, however.
> 
> 1. Does sync really wait for the completion of data be written on to
> disk?
> 
> I looked into the man page of sync(2) on Linux 2.0.36:
> 
>        According to  the  standard  specification  (e.g.,  SVID),
>        sync()  schedules  the  writes,  but may return before the
>        actual writing is done.   However,  since  version  1.3.20
>        Linux  does actually wait.  (This still does not guarantee
>        data integrity: modern disks have large caches.)
> 
> It seems that sync(2) blocks until data is written. So it would be ok
> at least with Linux. I'm not sure about other platforms, though.

It is incorrect to assume that sync() wait until all buffers are
flushed on any other platform than Linux, I didn't think
that Linux even did so but the kernel sources say yes.  

Solaris doesn't do this and niether does FreeBSD/NetBSD.

I guess if you wanted to implement this for linux only then it would
work, you ought to then also warn people that a non-dedicated db server
could experiance different performance using this code.

-Alfred

Re: [HACKERS] TODO item

From

Tatsuo Ishii

Date:

09 February 2000, 10:05:29

> > It seems that sync(2) blocks until data is written. So it would be ok
> > at least with Linux. I'm not sure about other platforms, though.
> 
> It is incorrect to assume that sync() wait until all buffers are
> flushed on any other platform than Linux, I didn't think
> that Linux even did so but the kernel sources say yes.  

Right. I have looked at Linux kernel sources and confirmed it.

> Solaris doesn't do this and niether does FreeBSD/NetBSD.

I'm not sure about Solaris since I don't have an access to its source
codes. Will look at FreeBSD kernel sources.

> I guess if you wanted to implement this for linux only then it would
> work, you ought to then also warn people that a non-dedicated db server
> could experiance different performance using this code.

I just want to have more choices other than with/without -F.  With -F
looses ACID, without it implies per-page-fsync. Both choices are
painful. But switching to expensive commercial DBMSs is much more
painful at least for me.

Even if it would be usefull on Linux only and in a certain situation,
it would better than nothing IMHO (until WAL comes up).
--
Tatsuo Ishii

Re: [HACKERS] TODO item

From

Tom Lane

Date:

09 February 2000, 10:08:29

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> [ use a global sync instead of fsync ]

> 1. Does sync really wait for the completion of data be written on to
> disk?

Linux is *alone* among Unix platforms in waiting; every other
implementation of sync() returns as soon as the last dirty buffer
is scheduled to be written.

> 2. Are we suffered any performance penalty from sync?

A global sync at the completion of every xact would be disastrous for
the performance of anything else on the system.

> However, in most cases the system is dedicated for only PostgreSQL,

"Most cases"?  Do you have any evidence for that?
        regards, tom lane

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

09 February 2000, 11:23:30

> Thinking about a little bit more, I have come across yet another
> possible solution. It is actually *very* simple. Details as follows.
> 
> In xact.c:RecordTransactionCommit() there are two FlushBufferPool
> calls. One is for relation files and the other is for pg_log. I add
> sync() right after these FlushBufferPool. It will force any pending
> kernel buffers physically be written onto disk, thus should guarantee
> the ACID of the transaction (see attached code fragment).

Interesting idea.  I had proposed this solution long ago.  My idea was
to buffer pg_log writes every 30 seconds.  Every 30 seconds, do a sync,
then write/sync pg_log.  Seemed like a good solution at the time, but
Vadim didn't like it.  I think he prefered to do logging, but honestly,
it was over a year ago, and we could have been benefiting from it all
this time.

Second, I had another idea.  What if we fsync()'ed a file descriptor
only when we were writing the _last_ dirty buffer for that file.  Seems
in many cases this would be a win.  I just don't know how hard that is
to figure out.  Seems there is no need to fsync() if we still have dirty
buffers around.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] TODO item

From

Alfred Perlstein

Date:

09 February 2000, 18:18:38

* Tatsuo Ishii <t-ishii@sra.co.jp> [000209 07:32] wrote:
> > > It seems that sync(2) blocks until data is written. So it would be ok
> > > at least with Linux. I'm not sure about other platforms, though.
> > 
> > It is incorrect to assume that sync() wait until all buffers are
> > flushed on any other platform than Linux, I didn't think
> > that Linux even did so but the kernel sources say yes.  
> 
> Right. I have looked at Linux kernel sources and confirmed it.
> 
> > Solaris doesn't do this and niether does FreeBSD/NetBSD.
> 
> I'm not sure about Solaris since I don't have an access to its source
> codes. Will look at FreeBSD kernel sources.
> 
> > I guess if you wanted to implement this for linux only then it would
> > work, you ought to then also warn people that a non-dedicated db server
> > could experiance different performance using this code.
> 
> I just want to have more choices other than with/without -F.  With -F
> looses ACID, without it implies per-page-fsync. Both choices are
> painful. But switching to expensive commercial DBMSs is much more
> painful at least for me.
> 
> Even if it would be usefull on Linux only and in a certain situation,
> it would better than nothing IMHO (until WAL comes up).

Ok, here's a nifty idea, a slave process called pgsyncer.  

At the end of a transaction a backend asks the syncer to fsync all files.

Now here's the cool part, this avoids the non-portability of the Linux
sync() problem and at the same time restricts the syncing to postgresql
and reduces 'cross-fsync' issues.

Imagine:

postgresql has 3 files open (a, b, c), so will the syncer.
backend 1 completes a request, communicates to the syncer that a flush is needed.
syncer starts by fsync'ing 'a'
backend 2 completes a request, communicates to the syncer
syncer continues with 'b' then 'c'
syncer responds to backend 1 that it's safe to proceed.
syncer fsyncs 'a' again
syncer responds to backend 2 that it's all completed.

effectively the fsync of 'b' and 'c' have been batched.

It's just an elevator algorithm, perhaps this can be done without
a seperate slave process?

-Alfred

Re: [HACKERS] TODO item

From

Tom Lane

Date:

09 February 2000, 18:28:35

Alfred Perlstein <bright@wintelcom.net> writes:
> postgresql has 3 files open (a, b, c), so will the syncer.

The syncer must have all the files open that are open in any backend?
What happens when it runs into the FDs-per-process limit?

> backend 1 completes a request, communicates to the syncer that a flush
>   is needed.
> syncer starts by fsync'ing 'a'
> backend 2 completes a request, communicates to the syncer
> syncer continues with 'b' then 'c'
> syncer responds to backend 1 that it's safe to proceed.
> syncer fsyncs 'a' again
> syncer responds to backend 2 that it's all completed.
> effectively the fsync of 'b' and 'c' have been batched.

And it's safe to update pg_log when?

I'm failing to see where the advantage is compared to the backends
issuing their own fsyncs...
        regards, tom lane

Re: [HACKERS] TODO item

From

Bruce Momjian

Date:

09 February 2000, 18:29:35

> Ok, here's a nifty idea, a slave process called pgsyncer.  
> 
> At the end of a transaction a backend asks the syncer to fsync all files.
> 
> Now here's the cool part, this avoids the non-portability of the Linux
> sync() problem and at the same time restricts the syncing to postgresql
> and reduces 'cross-fsync' issues.
> 
> Imagine:
> 
> postgresql has 3 files open (a, b, c), so will the syncer.
> backend 1 completes a request, communicates to the syncer that a flush
>   is needed.
> syncer starts by fsync'ing 'a'
> backend 2 completes a request, communicates to the syncer
> syncer continues with 'b' then 'c'
> syncer responds to backend 1 that it's safe to proceed.
> syncer fsyncs 'a' again
> syncer responds to backend 2 that it's all completed.
> 
> effectively the fsync of 'b' and 'c' have been batched.
> 
> It's just an elevator algorithm, perhaps this can be done without
> a seperate slave process?

If you go to the hackers archive, you will see an implementation under
subject "Bufferd loggins/pg_log" dated November 1997.  We have gone over
2 years without this option, and it is going to be even longer before it
is available via WAL.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: [HACKERS] TODO item

From

"Hiroshi Inoue"

Date:

09 February 2000, 19:28:36

> -----Original Message-----
> From: owner-pgsql-hackers@postgresql.org
> [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tom Lane
> 
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > [ use a global sync instead of fsync ]
> 
> > 1. Does sync really wait for the completion of data be written on to
> > disk?
> 
> Linux is *alone* among Unix platforms in waiting; every other
> implementation of sync() returns as soon as the last dirty buffer
> is scheduled to be written.
> 
> > 2. Are we suffered any performance penalty from sync?
> 
> A global sync at the completion of every xact would be disastrous for
> the performance of anything else on the system.
> 
> > However, in most cases the system is dedicated for only PostgreSQL,
> 
> "Most cases"?  Do you have any evidence for that?
>

Tatsuo is afraid of the delay of WAL
OTOH,it's not so easy to solve this item in current spec.
Probably he wants a quick and simple solution.

His solution is only for limited OS but is very simple.
Moreover it would make FlushBufferPool() more reliable(
I don't understand why FlushBufferPool() is allowed to not
call fsync() per page.).

The implementation would be in time for 7.0.
Is a temporary option unitl WAL bad ? 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp