Re: fsync method checking - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: fsync method checking |
Date | |
Msg-id | 200403181746.i2IHkDA00975@candle.pha.pa.us Whole thread Raw |
In response to | fsync method checking (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: fsync method checking
Re: fsync method checking |
List | pgsql-hackers |
I have been poking around with our fsync default options to see if I can improve them. One issue is that we never default to O_SYNC, but default to O_DSYNC if it exists, which seems strange. What I did was to beef up my test program and get it into CVS for folks to run. What I found was that different operating systems have different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was better, but on Linux, O_DSYNC/O_SYNC was faster. BSD/OS 4.3: Simple write timing: write 0.000055 Compare fsync before and after write's close: write, fsync, close 0.000707 write, close, fsync 0.000808 Compare one o_sync write to two: one 16k o_sync write 0.009762 two 8k o_sync writes 0.008799 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000658 (fdatasync unavailable) write, fsync, 0.000702 Compare file sync methods with 2 8k writes: (The fastest should be used for wal_sync_method) (o_dsync unavailable) open o_sync, write 0.010402 (fdatasync unavailable) write, fsync, 0.001025 This shows terrible O_SYNC performance for 2 8k writes, but is faster for a single 8k write. Strange. FreeBSD 4.9: Simple write timing: write 0.000083 Compare fsync before and after write's close: write, fsync, close 0.000412 write, close, fsync 0.000453 Compare one o_sync write to two: one 16k o_sync write 0.000409 two 8k o_sync writes 0.000993 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000683 (fdatasync unavailable) write, fsync, 0.000405 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 0.000789 (fdatasync unavailable) write, fsync, 0.000414 This shows fsync to be fastest in both cases. Linux 2.4.9: Simple write timing: write 0.000061 Compare fsync before and after write's close: write, fsync, close 0.000398 write, close, fsync 0.000407 Compare one o_sync write to two: one 16k o_sync write 0.000570 two 8k o_sync writes 0.000340 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000166 write, fdatasync 0.000462 write, fsync, 0.000447 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 0.000334 write, fdatasync 0.000445 write, fsync, 0.000447 This shows O_SYNC to be fastest, even for 2 8k writes. This unapplied patch: ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which controls the default for those platforms. Platforms with no template default to fdatasync/fsync. Would other users run src/tools/fsync and report their findings so I can update the template files for their OS's? This is a process similar to our thread testing. Thanks. --------------------------------------------------------------------------- Bruce Momjian wrote: > Mark Kirkwood wrote: > > This is a well-worn thread title - apologies, but these results seemed > > interesting, and hopefully useful in the quest to get better performance > > on Solaris: > > > > I was curious to see if the rather uninspiring pgbench performance > > obtained from a Sun 280R (see General: ATA Disks and RAID controllers > > for database servers) could be improved if more time was spent > > tuning. > > > > With the help of a fellow workmate who is a bit of a Solaris guy, we > > decided to have a go. > > > > The major performance killer appeared to be mounting the filesystem with > > the logging option. The next most significant seemed to be the choice of > > sync_method for Pg - the default (open_datasync), which we initially > > thought should be the best - appears noticeably slower than fdatasync. > > I thought the default was fdatasync, but looking at the code it seems > the default is open_datasync if O_DSYNC is available. > > I assume the logic is that we usually do only one write() before > fsync(), so open_datasync should be faster. Why do we not use O_FSYNC > over fsync(). > > Looking at the code: > > #if defined(O_SYNC) > #define OPEN_SYNC_FLAG O_SYNC > #else > #if defined(O_FSYNC) > #define OPEN_SYNC_FLAG O_FSYNC > #endif > #endif > > #if defined(OPEN_SYNC_FLAG) > #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG) > #define OPEN_DATASYNC_FLAG O_DSYNC > #endif > #endif > > #if defined(OPEN_DATASYNC_FLAG) > #define DEFAULT_SYNC_METHOD_STR "open_datasync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN > #define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG > #else > #if defined(HAVE_FDATASYNC) > #define DEFAULT_SYNC_METHOD_STR "fdatasync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC > #define DEFAULT_SYNC_FLAGBIT 0 > #else > #define DEFAULT_SYNC_METHOD_STR "fsync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC > #define DEFAULT_SYNC_FLAGBIT 0 > #endif > #endif > > I think the problem is that we prefer O_DSYNC over fdatasync, but do not > prefer O_FSYNC over fsync. > > Running the attached test program shows on BSD/OS 4.3: > > write 0.000360 > write & fsync 0.001391 > write, close & fsync 0.001308 > open o_fsync, write 0.000924 > > showing O_FSYNC faster than fsync(). > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > /* > * test_fsync.c > * tests if fsync can be done from another process than the original write > */ > > #include <sys/types.h> > #include <fcntl.h> > #include <stdio.h> > #include <stdlib.h> > #include <time.h> > #include <unistd.h> > > void die(char *str); > void print_elapse(struct timeval start_t, struct timeval elapse_t); > > int main(int argc, char *argv[]) > { > struct timeval start_t; > struct timeval elapse_t; > int tmpfile; > char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; > > /* write only */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* write & fsync */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > fsync(tmpfile); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write & fsync "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* write, close & fsync */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > /* reopen file */ > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > fsync(tmpfile); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write, close & fsync "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* open_fsync, write */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("open o_fsync, write "); > print_elapse(start_t, elapse_t); > printf("\n"); > > return 0; > } > > void print_elapse(struct timeval start_t, struct timeval elapse_t) > { > if (elapse_t.tv_usec < start_t.tv_usec) > { > elapse_t.tv_sec--; > elapse_t.tv_usec += 1000000; > } > > printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec), > (long) (elapse_t.tv_usec - start_t.tv_usec)); > } > > void die(char *str) > { > fprintf(stderr, "%s", str); > exit(1); > } > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-hackers by date: