Thread: RAID and SSD configuration question
Hi I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module (BTR-TFM8G-LSICVM02) - 2x300GB 10k spin drive, as raid 1 (OS) - 2x300GB 15k spin drive, as raid 1 (for xlog) - 2x200GB Intel DC S3710 SSD (for DB), as raid 1 So how is better for the SSDs: mdraid or controller's raid? I read a couple of times that is better in mdraid. In this case the SSDs is configured as write through in raid controller's bios and with disk cache enabled, right? What's the difference between Write Back and Always Write Back with supercap module? Thanks -- Levi
On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote: > Hi > > I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module > (BTR-TFM8G-LSICVM02) > - 2x300GB 10k spin drive, as raid 1 (OS) > - 2x300GB 15k spin drive, as raid 1 (for xlog) > - 2x200GB Intel DC S3710 SSD (for DB), as raid 1 > > So how is better for the SSDs: mdraid or controller's raid? I personally always prefer mdraid if given a choice, especially when you have a dedicated boot drive. It's better in DR scenarios and for hardware migrations. Personally I find dedicated RAID controllers to be baroque. Flash SSDs (at least the good ones) are basically big RAID 0s with their own dedicated cache, supercap, and controller optimized to the underlying storage peculiarities. > What's the difference between Write Back and Always Write Back with supercap > module? No clue. With spinning drives simple performance tests would make the caching behavior obvious but with SSD that's not always the case. I'm guessing(!) 'Always Write Back' allows the controller to buffer writes beyond what the devices do. merlin
On Tue, Oct 20, 2015 at 7:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote: >> Hi >> >> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module >> (BTR-TFM8G-LSICVM02) >> - 2x300GB 10k spin drive, as raid 1 (OS) >> - 2x300GB 15k spin drive, as raid 1 (for xlog) >> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1 >> >> So how is better for the SSDs: mdraid or controller's raid? > > I personally always prefer mdraid if given a choice, especially when > you have a dedicated boot drive. It's better in DR scenarios and for > hardware migrations. Personally I find dedicated RAID controllers to > be baroque. Flash SSDs (at least the good ones) are basically big > RAID 0s with their own dedicated cache, supercap, and controller > optimized to the underlying storage peculiarities. > >> What's the difference between Write Back and Always Write Back with supercap >> module? > > No clue. With spinning drives simple performance tests would make the > caching behavior obvious but with SSD that's not always the case. I'm > guessing(!) 'Always Write Back' allows the controller to buffer writes > beyond what the devices do. We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on. When we turn the write cache off, we get 15k to 20k tps. This is on a 120GB pgbench db that fits in memory, so it's all writes. Final answer: test it for yourself, you won't know until you do which is faster.
Hi, On 10/20/2015 03:30 PM, Merlin Moncure wrote: > On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote: >> Hi >> >> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module >> (BTR-TFM8G-LSICVM02) >> - 2x300GB 10k spin drive, as raid 1 (OS) >> - 2x300GB 15k spin drive, as raid 1 (for xlog) >> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1 >> >> So how is better for the SSDs: mdraid or controller's raid? > > I personally always prefer mdraid if given a choice, especially when > you have a dedicated boot drive. It's better in DR scenarios and for > hardware migrations. Personally I find dedicated RAID controllers to > be baroque. Flash SSDs (at least the good ones) are basically big > RAID 0s with their own dedicated cache, supercap, and controller > optimized to the underlying storage peculiarities. I don't know - I've always treated mdraid with a bit of suspicion as it does not have any "global" write cache, which might be allowing failure modes akin to the RAID5 write hole (similar issues exist for non-parity RAID levels like RAID-1 or RAID-10). I don't think the write cache on the devices prevents this, as it does not prevent problems with interruption between writes the two drives. > >> What's the difference between Write Back and Always Write Back >> withsupercap module? > > No clue. With spinning drives simple performance tests would make > the caching behavior obvious but with SSD that's not always the case. > I'm guessing(!) 'Always Write Back' allows the controller to buffer > writes beyond what the devices do. AFAIK there's no difference. It's an option that disables write cache in case the battery on BBU dies for some reason (so the write cache would become volatile). With capacitors this is not really applicable. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 10/20/2015 04:33 PM, Scott Marlowe wrote: > > We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we > can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on. > > When we turn the write cache off, we get 15k to 20k tps. This is on > a 120GB pgbench db that fits in memory, so it's all writes. I'm not really surprised that the performance increased so much, as the SSDs have large amounts of DRAM on them - with 10 devices it may easily be 10GB (compared to 1 or 2GB, which is common on RAID controllers). So the write cache on the controller may be a bottleneck. But the question is how disabling the write cache (on the controller) affects reliability of the whole RAID array. The write cache is there not only because it improves performance, but also because it protects against some failure modes - you're mentioned RAID-5 which is vulnerable to "write hole" problem. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 20, 2015 at 10:14 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > On 10/20/2015 03:30 PM, Merlin Moncure wrote: >> >> On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> >> wrote: >>> >>> Hi >>> >>> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap >>> module >>> (BTR-TFM8G-LSICVM02) >>> - 2x300GB 10k spin drive, as raid 1 (OS) >>> - 2x300GB 15k spin drive, as raid 1 (for xlog) >>> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1 >>> >>> So how is better for the SSDs: mdraid or controller's raid? >> >> >> I personally always prefer mdraid if given a choice, especially when >> you have a dedicated boot drive. It's better in DR scenarios and for >> hardware migrations. Personally I find dedicated RAID controllers to >> be baroque. Flash SSDs (at least the good ones) are basically big >> RAID 0s with their own dedicated cache, supercap, and controller >> optimized to the underlying storage peculiarities. > > I don't know - I've always treated mdraid with a bit of suspicion as it does > not have any "global" write cache, which might be allowing failure modes > akin to the RAID5 write hole (similar issues exist for non-parity RAID > levels like RAID-1 or RAID-10). mdadm is pretty smart. it knows when its shutdown unclean and recalculates parity as needed. There are some theoretical edge case failure scenarios, but they are well understood. This is md's main advantage really, it's transparency and the huge body of lore around it. I have tiny data recovery side business (cost 0$, invitation only) of DR on NAS systems that in some cases commercial DR companies said were irrecoverable. By simply googling and following guides I was able to come up with the data, or at least most of it, every time. Good luck with that on proprietary RAID systems. In fact, there is no reason to believe that proprietary systems cover the write hole even if they have a centralized cache. They may claim it does and in fact do so 99 times out of 100 but how do you know it's really covered? Basically, you don't. I kind of trust Intel (now, it's been a journey), but I don't have a lot of confidence in certain enterprise gear vendors. On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we > can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on. > > When we turn the write cache off, we get 15k to 20k tps. This is on a > 120GB pgbench db that fits in memory, so it's all writes. This is my findings exactly. I'll double down on my statement; caching raid controllers are essentially obsolete technology. They are designed to solve a problem that simply doesn't exist any more because of SSDs. Unless your database is very, very, busy it's pretty hard to saturate a single low-mid tier SSD with zero engineering effort. It's time to let go: spinning drives are obsolete in the database world, at least in any scenario where you're measuring IOPS. merlin
> On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we >> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on. >> >> When we turn the write cache off, we get 15k to 20k tps. This is on a >> 120GB pgbench db that fits in memory, so it's all writes. > > This is my findings exactly. I'll double down on my statement; > caching raid controllers are essentially obsolete technology. They > are designed to solve a problem that simply doesn't exist any more > because of SSDs. Unless your database is very, very, busy it's pretty > hard to saturate a single low-mid tier SSD with zero engineering > effort. It's time to let go: spinning drives are obsolete in the > database world, at least in any scenario where you're measuring IOPS. Here's what's REALLY messed up. The older the firmware on the megaraid, the faster it ran with caching on. We had 3 to 4 year old firmware and were getting 7 to 8k tps. As we upgraded firmware it got all the way down to 3k tps, then the very latest got it back up to 4k or so. No matter what version of the firmware, turning off caching got us to 15 to 18k easy. So it appears more aggressive and complex caching algorithms just made things worse and worse.
On Tue, Oct 20, 2015 at 12:28 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >>> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we >>> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on. >>> >>> When we turn the write cache off, we get 15k to 20k tps. This is on a >>> 120GB pgbench db that fits in memory, so it's all writes. >> >> This is my findings exactly. I'll double down on my statement; >> caching raid controllers are essentially obsolete technology. They >> are designed to solve a problem that simply doesn't exist any more >> because of SSDs. Unless your database is very, very, busy it's pretty >> hard to saturate a single low-mid tier SSD with zero engineering >> effort. It's time to let go: spinning drives are obsolete in the >> database world, at least in any scenario where you're measuring IOPS. > > Here's what's REALLY messed up. The older the firmware on the > megaraid, the faster it ran with caching on. We had 3 to 4 year old > firmware and were getting 7 to 8k tps. As we upgraded firmware it got > all the way down to 3k tps, then the very latest got it back up to 4k > or so. No matter what version of the firmware, turning off caching got > us to 15 to 18k easy. So it appears more aggressive and complex > caching algorithms just made things worse and worse. Another plausible explanation is that they fixed edge case concurrency issues in the firmware that came at the cost of performance, invalidating the engineering trade-offs made against the cheapo cpu they stuck on the controller next to the old, slow, 1GB dram.. Of course, we'll never know because the source code is proprietary and closed. I'll stick to mdadm, thanks. merlin