Thread: Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

"Zeugswetter Andreas SB SD"

Date:

05 February 2004, 07:45:40

> I don't think the bgwriter is going to be able to keep up with I/O bound
> backends, but I do think it can scan and set those booleans fast enough
> for the backends to then perform the writes.

As long as the bgwriter does not do sync writes (which it does not,
since that would need a whole lot of work to be performant) it calls
write which returns more or less at once.
So the bottleneck can only be the fsync. From those you would want
at least one per pg disk open in parallel.

But I think it should really be left to the OS when it actually does the IO
for the writes from the bgwriter inbetween checkpoints.
So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

Andreas

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Tom Lane

Date:

05 February 2004, 10:54:57

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> So Imho the target should be to have not much IO open for the checkpoint,
> so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives.  I am not sure if that hope has basis in fact or if it's
just wishful thinking.  Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

That means that instead of an I/O storm every checkpoint interval,
we get a smaller I/O storm every 30 seconds.  Not sure this is a big
improvement.  Jan already found out that issuing very frequent sync()s
isn't a win.

People keep saying that the bgwriter mustn't write pages synchronously
because it'd be bad for performance, but I think that analysis is
faulty.  Performance of what --- the bgwriter?  Nonsense, the *point*
of the bgwriter is to do the slow tasks.  The only argument that has
any merit is that O_SYNC or immediate fsync will prevent us from having
multiple writes outstanding and thus reduce the efficiency of disk
write scheduling.  This is a valid point but there is a limit to how
many writes we need to have in flight to keep things flowing smoothly.

What I'm thinking now is that the bgwriter should issue frequent fsyncs
for its writes --- not immediate, but a lot more often than once per
checkpoint.  Perhaps take one recently-written unsynced file to fsync
every time it is about to sleep.  You could imagine various rules for
deciding which one to sync; perhaps the one with the most writes issued
against it since last sync.  When we have tablespaces it'd make sense to
try to distribute the syncs across tablespaces, on the assumption that
the tablespaces are probably on different drives.

            regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Shridhar Daithankar

Date:

05 February 2004, 11:14:35

On Thursday 05 February 2004 20:24, Tom Lane wrote:
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> > So Imho the target should be to have not much IO open for the checkpoint,
> > so the fsync is fast enough, even if serial.
>
> The best we can do is push out dirty pages with write() via the bgwriter
> and hope that the kernel will see fit to write them before checkpoint
> time arrives.  I am not sure if that hope has basis in fact or if it's
> just wishful thinking.  Most likely, if it does have basis in fact it's
> because there is a standard syncer daemon forcing a sync() every thirty
> seconds.

There are other benefits of writing pages earlier even though they might not 
get synced immediately.

It would tell kernel that this is latest copy of updated buffer. Kernel VFS 
should make that copy visible to every other backend as well. The buffer 
manager will fetch the updated copy from VFS cache next time. All without 
going to disk actually..(Within the 30 seconds window of course..)

> People keep saying that the bgwriter mustn't write pages synchronously
> because it'd be bad for performance, but I think that analysis is
> faulty.  Performance of what --- the bgwriter?  Nonsense, the *point*
> of the bgwriter is to do the slow tasks.  The only argument that has
> any merit is that O_SYNC or immediate fsync will prevent us from having
> multiple writes outstanding and thus reduce the efficiency of disk
> write scheduling.  This is a valid point but there is a limit to how
> many writes we need to have in flight to keep things flowing smoothly.

Is it a valid assumption for platforms-that-postgresql-supports that a write 
call would make changes visible across processes?
>
> What I'm thinking now is that the bgwriter should issue frequent fsyncs
> for its writes --- not immediate, but a lot more often than once per

frequent fsyncs or frequent fsyncs per file descriptor written? I thought it 
was later.

Just a thought.
Shridhar

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Tom Lane

Date:

05 February 2004, 12:32:31

Shridhar Daithankar <shridhar@frodo.hserus.net> writes:
> There are other benefits of writing pages earlier even though they might not 
> get synced immediately.

Such as?

> It would tell kernel that this is latest copy of updated buffer. Kernel VFS 
> should make that copy visible to every other backend as well. The buffer 
> manager will fetch the updated copy from VFS cache next time. All without 
> going to disk actually..(Within the 30 seconds window of course..)

This seems quite irrelevant given the way we handle shared buffers.

> frequent fsyncs or frequent fsyncs per file descriptor written? I thought it 
> was later.

You can only fsync one FD at a time (too bad ... if there were a
multi-file-fsync API it'd solve the overspecified-write-ordering issue).
        regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Jan Wieck

Date:

06 February 2004, 12:10:17

Tom Lane wrote:

> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> So Imho the target should be to have not much IO open for the checkpoint,
>> so the fsync is fast enough, even if serial.
>
> The best we can do is push out dirty pages with write() via the bgwriter
> and hope that the kernel will see fit to write them before checkpoint
> time arrives.  I am not sure if that hope has basis in fact or if it's
> just wishful thinking.  Most likely, if it does have basis in fact it's
> because there is a standard syncer daemon forcing a sync() every thirty
> seconds.

Looking at the response time charts I did for showing how vacuum delay
is doing, it seems at least on Linux there is hope that that is the
case. Those charts have just a regular 5 minute checkpoint with enough
checkpoint segments for that, and no other sync effort done at all.

The system has a hard time to handle a larger scaled test DB, so it is
definitely well saturated with IO. The charts are here:

     http://developer.postgresql.org/~wieck/vacuum_cost/

>
> That means that instead of an I/O storm every checkpoint interval,
> we get a smaller I/O storm every 30 seconds.  Not sure this is a big
> improvement.  Jan already found out that issuing very frequent sync()s
> isn't a win.

In none of those charts I can see any checkpoint caused IO storm any
more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
at checkpoints. If someone is interested in those as well I will put
them up.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Bruce Momjian

Date:

07 February 2004, 21:45:02

Jan Wieck wrote:
> Tom Lane wrote:
>
> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> >> So Imho the target should be to have not much IO open for the checkpoint,
> >> so the fsync is fast enough, even if serial.
> >
> > The best we can do is push out dirty pages with write() via the bgwriter
> > and hope that the kernel will see fit to write them before checkpoint
> > time arrives.  I am not sure if that hope has basis in fact or if it's
> > just wishful thinking.  Most likely, if it does have basis in fact it's
> > because there is a standard syncer daemon forcing a sync() every thirty
> > seconds.
>
> Looking at the response time charts I did for showing how vacuum delay
> is doing, it seems at least on Linux there is hope that that is the
> case. Those charts have just a regular 5 minute checkpoint with enough
> checkpoint segments for that, and no other sync effort done at all.
>
> The system has a hard time to handle a larger scaled test DB, so it is
> definitely well saturated with IO. The charts are here:
>
>      http://developer.postgresql.org/~wieck/vacuum_cost/
>
> >
> > That means that instead of an I/O storm every checkpoint interval,
> > we get a smaller I/O storm every 30 seconds.  Not sure this is a big
> > improvement.  Jan already found out that issuing very frequent sync()s
> > isn't a win.
>
> In none of those charts I can see any checkpoint caused IO storm any
> more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
> at checkpoints. If someone is interested in those as well I will put
> them up.

So, Jan, are you basically saying that the background writer has solved
the checkpoint I/O flood problem, and we just need to deal with changing
sync to multiple fsync's at checkpoint?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Jan Wieck

Date:

09 February 2004, 11:34:34

Bruce Momjian wrote:

> Jan Wieck wrote:
>> Tom Lane wrote:
>>
>> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> >> So Imho the target should be to have not much IO open for the checkpoint,
>> >> so the fsync is fast enough, even if serial.
>> >
>> > The best we can do is push out dirty pages with write() via the bgwriter
>> > and hope that the kernel will see fit to write them before checkpoint
>> > time arrives.  I am not sure if that hope has basis in fact or if it's
>> > just wishful thinking.  Most likely, if it does have basis in fact it's
>> > because there is a standard syncer daemon forcing a sync() every thirty
>> > seconds.
>>
>> Looking at the response time charts I did for showing how vacuum delay
>> is doing, it seems at least on Linux there is hope that that is the
>> case. Those charts have just a regular 5 minute checkpoint with enough
>> checkpoint segments for that, and no other sync effort done at all.
>>
>> The system has a hard time to handle a larger scaled test DB, so it is
>> definitely well saturated with IO. The charts are here:
>>
>>      http://developer.postgresql.org/~wieck/vacuum_cost/
>>
>> >
>> > That means that instead of an I/O storm every checkpoint interval,
>> > we get a smaller I/O storm every 30 seconds.  Not sure this is a big
>> > improvement.  Jan already found out that issuing very frequent sync()s
>> > isn't a win.
>>
>> In none of those charts I can see any checkpoint caused IO storm any
>> more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
>> at checkpoints. If someone is interested in those as well I will put
>> them up.
>
> So, Jan, are you basically saying that the background writer has solved
> the checkpoint I/O flood problem, and we just need to deal with changing
> sync to multiple fsync's at checkpoint?

ISTM that the background writer at least has the ability to lower the
impact of a checkpoint significantly enough that one might not care
about it any more. "Has the ability" means, it needs to be adjusted to
the actual DB usage. The charts I produced where not done with the
default settings, but rather after making the bgwriter a bit more
agressive against dirty pages.

The whole sync() vs. fsync() discussion is in my opinion nonsense at
this point. Without the ability to limit the amount of files to a
reasonable number, by employing tablespaces in the form of larger
container files, the risk of forcing excessive head movement is simply
too high.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Tom Lane

Date:

09 February 2004, 13:40:28

Jan Wieck <JanWieck@Yahoo.com> writes:
> The whole sync() vs. fsync() discussion is in my opinion nonsense at
> this point.

The sync vs fsync discussion is not about performance, it is about
correctness.  You can't simply dismiss the fact that we don't know
whether a checkpoint is really complete when we write the checkpoint
record.

I liked the idea put forward by (I think) Kevin Brown, that we issue
sync to start the I/O and then a bunch of fsyncs to wait for it to
finish.  If sync behaves per spec ("all the I/O is scheduled upon
return") then the fsyncs will not affect I/O ordering in the least.
But they will ensure that we don't proceed until the I/O is all done.

Also there is the Windows-port problem of not having sync available.
Doing the fsyncs only will provide an adequate, if possibly
lower-performing, solution there.

            regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Greg Stark

Date:

09 February 2004, 16:51:12

Jan Wieck <JanWieck@Yahoo.com> writes:

> The whole sync() vs. fsync() discussion is in my opinion nonsense at this
> point. Without the ability to limit the amount of files to a reasonable number,
> by employing tablespaces in the form of larger container files, the risk of
> forcing excessive head movement is simply too high.

I don't think there was any suggestion of conflating tablespaces with
implementing a filesystem in postgres.

Tablespaces are just a database entity that database stored objects like
tables and indexes are associated to. They group database stored objects and
control the storage method and location.

The existing storage mechanism, namely a directory with a file for each
database object, is perfectly adequate and doesn't have to be replaced to
implement tablespaces. All that's needed is that the location of the directory
be associated with the "tablespace" of the object rather than be a global
constant.

Implementing an Oracle-style filesystem is just one more temptation to
reimplement OS services in the database. Personally I think it's an awful
idea. But even if postgres did it as an option, it wouldn't necessarily have
anything to do with tablespaces.

-- 
greg

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Jan Wieck

Date:

09 February 2004, 17:42:39

Greg Stark wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
> 
>> The whole sync() vs. fsync() discussion is in my opinion nonsense at this
>> point. Without the ability to limit the amount of files to a reasonable number,
>> by employing tablespaces in the form of larger container files, the risk of
>> forcing excessive head movement is simply too high.
> 
> I don't think there was any suggestion of conflating tablespaces with
> implementing a filesystem in postgres.
> 
> Tablespaces are just a database entity that database stored objects like
> tables and indexes are associated to. They group database stored objects and
> control the storage method and location.
> 
> The existing storage mechanism, namely a directory with a file for each
> database object, is perfectly adequate and doesn't have to be replaced to
> implement tablespaces. All that's needed is that the location of the directory
> be associated with the "tablespace" of the object rather than be a global
> constant.
> 
> Implementing an Oracle-style filesystem is just one more temptation to
> reimplement OS services in the database. Personally I think it's an awful
> idea. But even if postgres did it as an option, it wouldn't necessarily have
> anything to do with tablespaces.
> 

Doing this is not just what you call it. In a system with let's say 500 
active backends on a database with let's say 1000 things that are 
represented as a file, you'll need half a million virtual file descriptors.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Tom Lane

Date:

09 February 2004, 21:27:14

Jan Wieck <JanWieck@Yahoo.com> writes:
> Doing this is not just what you call it. In a system with let's say 500 
> active backends on a database with let's say 1000 things that are 
> represented as a file, you'll need half a million virtual file descriptors.

[shrug]  We've been dealing with virtual file descriptors for years.
I've seen no indication that they create any performance bottlenecks.
        regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Florian Weimer

Date:

15 February 2004, 09:05:43

Tom Lane wrote:

> You can only fsync one FD at a time (too bad ... if there were a
> multi-file-fsync API it'd solve the overspecified-write-ordering issue).

What about aio_fsync()?

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Tom Lane

Date:

15 February 2004, 12:27:07

Florian Weimer <fw@deneb.enyo.de> writes:
> Tom Lane wrote:
>> You can only fsync one FD at a time (too bad ... if there were a
>> multi-file-fsync API it'd solve the overspecified-write-ordering issue).

> What about aio_fsync()?

(1) it's unportable; (2) it's not clear that it's any improvement over
fsync().  The Single Unix Spec says aio_fsync "returns when the
synchronisation request has been initiated or queued to the file or
device".  Depending on how the implementation works, this may mean that
all the dirty blocks have been scheduled for I/O and will be written
ahead of subsequently scheduled blocks --- if so, the results are not
really different from fsync()'ing the files in the same order.

The best idea I've heard so far is the one about sync() followed by
a bunch of fsync()s.  That seems to be correct, efficient, and dependent
only on very-long-established Unix semantics.
        regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From

Bruce Momjian

Date:

16 February 2004, 11:44:43

Tom Lane wrote:
> The best idea I've heard so far is the one about sync() followed by
> a bunch of fsync()s.  That seems to be correct, efficient, and dependent
> only on very-long-established Unix semantics.

Agreed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: [pgsql-hackers-win32] Sync vs. fsync during

From

pgsql@mohawksoft.com

Date:

14 May 2004, 08:34:37

> Greg Stark wrote:
>
>> Jan Wieck <JanWieck@Yahoo.com> writes:
>>
>>> The whole sync() vs. fsync() discussion is in my opinion nonsense at
>>> this
>>> point. Without the ability to limit the amount of files to a reasonable
>>> number,
>>> by employing tablespaces in the form of larger container files, the
>>> risk of
>>> forcing excessive head movement is simply too high.
>>
>> I don't think there was any suggestion of conflating tablespaces with
>> implementing a filesystem in postgres.
>>
>> Tablespaces are just a database entity that database stored objects like
>> tables and indexes are associated to. They group database stored objects
>> and
>> control the storage method and location.
>>
>> The existing storage mechanism, namely a directory with a file for each
>> database object, is perfectly adequate and doesn't have to be replaced
>> to
>> implement tablespaces. All that's needed is that the location of the
>> directory
>> be associated with the "tablespace" of the object rather than be a
>> global
>> constant.
>>
>> Implementing an Oracle-style filesystem is just one more temptation to
>> reimplement OS services in the database. Personally I think it's an
>> awful
>> idea. But even if postgres did it as an option, it wouldn't necessarily
>> have
>> anything to do with tablespaces.
>>
>
> Doing this is not just what you call it. In a system with let's say 500
> active backends on a database with let's say 1000 things that are
> represented as a file, you'll need half a million virtual file
> descriptors.

I'm sort of a purist, I think that operating systems should be operating
systems and applications should be applications. Whenever you try to do
application like things in an OS, it is a mistake. Whenever you try to do
OS like things in an application, it - also, is a mistake.

Say a database has close to a active thousand files and you do have 100
concurrent user's. Why do you think that this could be handled better in
an application? Are you saying that PostgreSQL could do a better job at
managing 1/2 million shared file descriptors than the OS?

Your example, IMHO, points out why you *shouldn't* try to have a dedicated
file system.