Weird XFS WAL problem - Mailing list pgsql-performance
| From | Craig James |
|---|---|
| Subject | Weird XFS WAL problem |
| Date | |
| Msg-id | 4C06E994.2080905@emolecules.com Whole thread Raw |
| In response to | Re: Random Page Cost and Planner (Cédric Villemain <cedric.villemain.debian@gmail.com>) |
| Responses |
Re: Weird XFS WAL problem
Re: Weird XFS WAL problem Re: Weird XFS WAL problem |
| List | pgsql-performance |
I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the
WALto a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a
reason.
Here are the details.
8 cores (2x4 Intel Nehalem 2 GHz)
12 GB memory
12 x 7200 SATA 500 GB disks
3WARE 9650SE-12ML RAID controller with bbu
2 disks: RAID1 500GB ext4 blocksize=4096
8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below)
2 disks: hot swap
Ubuntu 10.04 LTS (Lucid)
With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this one is for xfs):
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
argon 24064M 70491 99 288158 25 129918 16 65296 97 428210 23 558.9 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 23283 81 +++++ +++ 13775 56 20143 74 +++++ +++ 15152 54
argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\
,74,+++++,+++,15152,54
pgbench -i -s 100 -U test
pgbench -c 10 -t 10000 -U test
scaling factor: 100
query mode: simple
number of clients: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
tps = 1046.104635 (including connections establishing)
tps = 1046.337276 (excluding connections establishing)
Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE controller, two more SATA 7200 disks). Run
thesame tests and ...
tps = 82.325446 (including connections establishing)
tps = 82.326874 (excluding connections establishing)
I thought I'd made a mistake, like maybe I moved the whole database to the RAID1 array, but I checked and double
checked. I even watched the lights blink - the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10.
So I moved the WAL back to the RAID10 array, and performance jumped right back up to the >1200 TPS range.
Next I check the RAID1 itself:
dd if=/dev/zero of=./bigfile bs=8192 count=2000000
which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good performance too:
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
argon 24064M 68601 99 110057 18 46534 6 59883 90 123053 7 471.3 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\
+++,+++++,+++,+++++,+++
So ... anyone have any idea at all how TPS drops to below 90 when I move the WAL to a separate RAID1 disk? Does this
makeany sense at all? It's repeatable. It happens for both ext4 and xfs. It's weird.
You can even watch the disk lights and see it: the RAID10 disks are on almost constantly when the WAL is on the RAID10,
butwhen you move the WAL over to the RAID1, its lights are dim and flicker a lot, like it's barely getting any data,
andthe RAID10 disk's lights barely go on at all.
Thanks,
Craig
pgsql-performance by date: