Thread: RAID and SSD configuration question

RAID and SSD configuration question

From

Birta Levente

Date:

20 October 2015, 08:14:17

Hi

I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap
module (BTR-TFM8G-LSICVM02)
- 2x300GB 10k spin drive, as raid 1 (OS)
- 2x300GB 15k spin drive, as raid 1 (for xlog)
- 2x200GB Intel DC S3710 SSD (for DB), as raid 1

So how is better for the SSDs: mdraid or controller's raid?

I read a couple of times that is better in mdraid. In this case the SSDs
is configured as write through in raid controller's bios and with disk
cache enabled, right?

What's the difference between Write Back and Always Write Back with
supercap module?

Thanks

--
            Levi

Re: RAID and SSD configuration question

From

Merlin Moncure

Date:

20 October 2015, 13:30:17

On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote:
> Hi
>
> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module
> (BTR-TFM8G-LSICVM02)
> - 2x300GB 10k spin drive, as raid 1 (OS)
> - 2x300GB 15k spin drive, as raid 1 (for xlog)
> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1
>
> So how is better for the SSDs: mdraid or controller's raid?

I personally always prefer mdraid if given a choice, especially when
you have a dedicated boot drive.  It's better in DR scenarios and for
hardware migrations.  Personally I find dedicated RAID controllers to
be baroque.  Flash SSDs (at least the good ones) are basically big
RAID 0s with their own dedicated cache, supercap, and controller
optimized to the underlying storage peculiarities.

> What's the difference between Write Back and Always Write Back with supercap
> module?

No clue.  With spinning drives simple performance tests would make the
caching behavior obvious but with SSD that's not always the case.  I'm
guessing(!) 'Always Write Back' allows the controller to buffer writes
beyond what the devices do.

merlin

Re: RAID and SSD configuration question

From

Scott Marlowe

Date:

20 October 2015, 14:33:47

On Tue, Oct 20, 2015 at 7:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote:
>> Hi
>>
>> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module
>> (BTR-TFM8G-LSICVM02)
>> - 2x300GB 10k spin drive, as raid 1 (OS)
>> - 2x300GB 15k spin drive, as raid 1 (for xlog)
>> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1
>>
>> So how is better for the SSDs: mdraid or controller's raid?
>
> I personally always prefer mdraid if given a choice, especially when
> you have a dedicated boot drive.  It's better in DR scenarios and for
> hardware migrations.  Personally I find dedicated RAID controllers to
> be baroque.  Flash SSDs (at least the good ones) are basically big
> RAID 0s with their own dedicated cache, supercap, and controller
> optimized to the underlying storage peculiarities.
>
>> What's the difference between Write Back and Always Write Back with supercap
>> module?
>
> No clue.  With spinning drives simple performance tests would make the
> caching behavior obvious but with SSD that's not always the case.  I'm
> guessing(!) 'Always Write Back' allows the controller to buffer writes
> beyond what the devices do.

We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.

When we turn the write cache off, we get 15k to 20k tps. This is on a
120GB pgbench db that fits in memory, so it's all writes.

Final answer: test it for yourself, you won't know until you do which is faster.

Re: RAID and SSD configuration question

From

Tomas Vondra

Date:

20 October 2015, 15:14:14

Hi,

On 10/20/2015 03:30 PM, Merlin Moncure wrote:
> On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com> wrote:
>> Hi
>>
>> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap module
>> (BTR-TFM8G-LSICVM02)
>> - 2x300GB 10k spin drive, as raid 1 (OS)
>> - 2x300GB 15k spin drive, as raid 1 (for xlog)
>> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1
>>
>> So how is better for the SSDs: mdraid or controller's raid?
>
> I personally always prefer mdraid if given a choice, especially when
> you have a dedicated boot drive.  It's better in DR scenarios and for
> hardware migrations.  Personally I find dedicated RAID controllers to
> be baroque.  Flash SSDs (at least the good ones) are basically big
> RAID 0s with their own dedicated cache, supercap, and controller
> optimized to the underlying storage peculiarities.

I don't know - I've always treated mdraid with a bit of suspicion as it
does not have any "global" write cache, which might be allowing failure
modes akin to the RAID5 write hole (similar issues exist for non-parity
RAID levels like RAID-1 or RAID-10).

I don't think the write cache on the devices prevents this, as it does
not prevent problems with interruption between writes the two drives.

>
>> What's the difference between Write Back and Always Write Back
>> withsupercap module?
>
> No clue. With spinning drives simple performance tests would make
> the caching behavior obvious but with SSD that's not always the case.
> I'm guessing(!) 'Always Write Back' allows the controller to buffer
> writes beyond what the devices do.

AFAIK there's no difference. It's an option that disables write cache in
case the battery on BBU dies for some reason (so the write cache would
become volatile). With capacitors this is not really applicable.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: RAID and SSD configuration question

From

Tomas Vondra

Date:

20 October 2015, 15:25:36

Hi,

On 10/20/2015 04:33 PM, Scott Marlowe wrote:
>
> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.
>
> When we turn the write cache off, we get 15k to 20k tps. This is on
> a 120GB pgbench db that fits in memory, so it's all writes.

I'm not really surprised that the performance increased so much, as the
SSDs have large amounts of DRAM on them - with 10 devices it may easily
be 10GB (compared to 1 or 2GB, which is common on RAID controllers). So
the write cache on the controller may be a bottleneck.

But the question is how disabling the write cache (on the controller)
affects reliability of the whole RAID array.

The write cache is there not only because it improves performance, but
also because it protects against some failure modes - you're mentioned
RAID-5 which is vulnerable to "write hole" problem.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: RAID and SSD configuration question

From

Merlin Moncure

Date:

20 October 2015, 16:26:39

On Tue, Oct 20, 2015 at 10:14 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On 10/20/2015 03:30 PM, Merlin Moncure wrote:
>>
>> On Tue, Oct 20, 2015 at 3:14 AM, Birta Levente <blevi.linux@gmail.com>
>> wrote:
>>>
>>> Hi
>>>
>>> I have a supermicro SYS-1028R-MCTR, LSI3108 integrated with SuperCap
>>> module
>>> (BTR-TFM8G-LSICVM02)
>>> - 2x300GB 10k spin drive, as raid 1 (OS)
>>> - 2x300GB 15k spin drive, as raid 1 (for xlog)
>>> - 2x200GB Intel DC S3710 SSD (for DB), as raid 1
>>>
>>> So how is better for the SSDs: mdraid or controller's raid?
>>
>>
>> I personally always prefer mdraid if given a choice, especially when
>> you have a dedicated boot drive.  It's better in DR scenarios and for
>> hardware migrations.  Personally I find dedicated RAID controllers to
>> be baroque.  Flash SSDs (at least the good ones) are basically big
>> RAID 0s with their own dedicated cache, supercap, and controller
>> optimized to the underlying storage peculiarities.
>
> I don't know - I've always treated mdraid with a bit of suspicion as it does
> not have any "global" write cache, which might be allowing failure modes
> akin to the RAID5 write hole (similar issues exist for non-parity RAID
> levels like RAID-1 or RAID-10).

mdadm is pretty smart.  it knows when its shutdown unclean and
recalculates parity as needed.  There are some theoretical edge case
failure scenarios, but they are well understood.  This is md's main
advantage really, it's transparency and the huge body of lore around
it.  I have tiny data recovery side business (cost 0$, invitation
only) of DR on NAS systems that in some cases commercial DR companies
said were irrecoverable. By simply googling and following guides I was
able to come up with the data, or at least most of it, every time.
Good luck with that on proprietary RAID systems.  In fact, there is no
reason to believe that proprietary systems cover the write hole even
if they have a centralized cache.   They may claim it does and in fact
do so 99 times out of 100 but how do you know it's really covered?
Basically, you don't.  I kind of trust Intel (now, it's been a
journey), but I don't have a lot of confidence in certain enterprise
gear vendors.

On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.
>
> When we turn the write cache off, we get 15k to 20k tps. This is on a
> 120GB pgbench db that fits in memory, so it's all writes.

This is my findings exactly.  I'll double down on my statement;
caching raid controllers are essentially obsolete technology.  They
are designed to solve a problem that simply doesn't exist any more
because of SSDs.  Unless your database is very, very, busy it's pretty
hard to saturate a single low-mid tier SSD with zero engineering
effort.  It's time to let go:  spinning drives are obsolete in the
database world, at least in any scenario where you're measuring IOPS.

merlin

Re: RAID and SSD configuration question

From

Scott Marlowe

Date:

20 October 2015, 17:28:31

> On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
>> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.
>>
>> When we turn the write cache off, we get 15k to 20k tps. This is on a
>> 120GB pgbench db that fits in memory, so it's all writes.
>
> This is my findings exactly.  I'll double down on my statement;
> caching raid controllers are essentially obsolete technology.  They
> are designed to solve a problem that simply doesn't exist any more
> because of SSDs.  Unless your database is very, very, busy it's pretty
> hard to saturate a single low-mid tier SSD with zero engineering
> effort.  It's time to let go:  spinning drives are obsolete in the
> database world, at least in any scenario where you're measuring IOPS.

Here's what's REALLY messed up. The older the firmware on the
megaraid, the faster it ran with caching on. We had 3 to 4 year old
firmware and were getting 7 to 8k tps. As we upgraded firmware it got
all the way down to 3k tps, then the very latest got it back up to 4k
or so. No matter what version of the firmware, turning off caching got
us to 15 to 18k easy. So it appears more aggressive and complex
caching algorithms just made things worse and worse.

Re: RAID and SSD configuration question

From

Merlin Moncure

Date:

20 October 2015, 19:09:53

On Tue, Oct 20, 2015 at 12:28 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Tue, Oct 20, 2015 at 9:33 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>> We're running LSI MegaRAIDs at work with 10 SSD RAID-5 arrays, and we
>>> can get ~5k to 7k tps on a -s 10000 pgbench with the write cache on.
>>>
>>> When we turn the write cache off, we get 15k to 20k tps. This is on a
>>> 120GB pgbench db that fits in memory, so it's all writes.
>>
>> This is my findings exactly.  I'll double down on my statement;
>> caching raid controllers are essentially obsolete technology.  They
>> are designed to solve a problem that simply doesn't exist any more
>> because of SSDs.  Unless your database is very, very, busy it's pretty
>> hard to saturate a single low-mid tier SSD with zero engineering
>> effort.  It's time to let go:  spinning drives are obsolete in the
>> database world, at least in any scenario where you're measuring IOPS.
>
> Here's what's REALLY messed up. The older the firmware on the
> megaraid, the faster it ran with caching on. We had 3 to 4 year old
> firmware and were getting 7 to 8k tps. As we upgraded firmware it got
> all the way down to 3k tps, then the very latest got it back up to 4k
> or so. No matter what version of the firmware, turning off caching got
> us to 15 to 18k easy. So it appears more aggressive and complex
> caching algorithms just made things worse and worse.

Another plausible explanation is that they fixed edge case concurrency
issues in the firmware that came at the cost of performance,
invalidating the engineering trade-offs made against the cheapo cpu
they stuck on the controller next to the old, slow, 1GB dram..  Of
course, we'll never know because the source code is proprietary and
closed.  I'll stick to mdadm, thanks.

merlin