Reliability is one of the most critical factors for customers when choosing a SAN. The reasoning is simple: system reliability translates directly to data loss and downtime. The most vital component in a SAN is the disk drive, where the data is physically stored. Since data loss is unacceptable, protection against disk drive failure is a must. LeftHand’s Storage systems deliver advanced data protection levels to safeguard the SAN from multiple disk failures in the same array, complete array failures, and site failures without losing data or availability.
If we take a closer look at how the reliability model, specifically for disk drives, has changed over the years, I believe you will start to understand why legacy data protection technology isn’t sufficient anymore.
Most of you are probably familiar with the term MTTF (Mean Time To Failure). Storage system availability and reliability are calculated based on the MTTF of all the components that make up the storage system along with their level of redundancy. In spite of its proven inaccuracy as a predictor of reliability, MTTF is the basis for most vendors’ reliability calculations.
Bianca Schroeder and Garth Gibson at Carnegie MellonUniversity presented an analysis of field-gathered disk replacement data from a number of production systems totaling about 100,000 disks. Data from their analysis revealed that replacement rates in all years, except for year 1, are higher than the disk drive datasheet MTTF specification. In years 4 and 5 (which are still within the nominal lifetime of these disks), the actual replacement rates were 7–10 times higher than the failure rates expected based on datasheet MTTF.
This simply means disk drives will fail more than the disk drive and storage system manufacturers say they do. So now let’s talk about Bit Error Rate and how that influences reliability. Bit Error Rate, or BER, is the most critical factor of all. BER simply means that while reading your data from the disk drive you will get an average of one non-recoverable error in so many bits read, as specified by the manufacturer. This ratio has nothing to do with the MTTF.
Rebuilding the data on a replacement drive with most RAID algorithms requires that all the other data on the other drives be pristine and error free. If there is a single error in a single sector, then the data for the corresponding sector on the replacement drive cannot be reconstructed, and therefore the RAID rebuild fails and data is lost. The frequency of this disastrous occurrence is derived from the BER. Simple calculations will show that the chance of data loss due to BER is much greater than all other reasons combined. The probability that data will be lost in the course of a rebuild operation can be estimated on the basis of the total capacity of the array combined with the probability of a bit error occurring on an individual drive. The bit error probabilities provided by drive manufacturers typically range from 1:10^14 for SATA drives to 1:10^16 for Enterprise Class SAS drives. The diagram below shows that an increase in a drive BER of one order of magnitude has serious consequences for large capacity arrays.
The data storage industry has followed Moore’s law for semiconductor memory: capacity doubles every 12 to 18 months. This has held true for the past three decades or more in hard disk storage. Disk drive capacities have gone from 5 MB to 1TB in just 25 years - a 200,000-fold increase. Disk drive capacities are up 1000-X from 1 GB to 1TB in just the last 10 years. Alarmingly, as drive capacities have increased, BERs have remained relatively constant. So a drive with a 1000-times increase in capacity will experience a non-recoverable read error event 1000 times more frequent when reading the entire disk drive, which is a common operation when rebuilding a hot spare or failed drive replacement in a RAID set.
This data is very alarming when considering the proliferation of SATA drives in the enterprise. Customers that are aware of this exposure may choose to implement RAID 1 or RAID 10 protection schemes to address the shortcomings of SATA drives. What we do know about older SATA drives is that they have low MTTFs and the Bit Error Rate is 1:10^14. That's approximately 1 bit error per 11.6TB read. Compare this to SAS drives with a BER as low as 1:10^16 which is 1 bit error per 1,164TB read.
Looking at a real life example, a 14 drive array with 750GB SATA drives configured with two 7 drive RAID 5 sets has a 38.7% probability of experiencing a non-recoverable read error during a RAID re-build. This means there is a 38.7% probability of losing all your data when a drive fails, even though the array is protected by RAID 5! With 1TB drives the probability is 51.5%. Even in RAID 10, where drives are typically configured in mirrored pairs, the probability of a read error is still 8.6%. That’s pretty frightening.
Another alarming fact is that some storage vendors allow RAID groups to span multiple disk shelves, where 10s of drives can reside in a single RAID 5 set. Take a 12 drive RAID 5 set of 1TB drives; it has nearly a 100% probability of data loss during a RAID reconstruct!
Customers should be asking the hard questions of any storage vendor selling SATA drives without some form of double-error protecting RAID. Even with 400GB Enterprise Class Fibre Channel drives with a BER of 10^15, you still lose data in about 2% of 7 drive RAID set reconstructs. Even a 1% probability of data loss is unacceptable to most customers, especially when considering that dreaded backup window.
How LeftHand Networks is Changing the Data Protection Paradigm
LeftHand Networks offers multiple options for data protection:
- RAID 5 and 10: LeftHand supports widely used RAID levels. When combined with Network RAID a customer can protect the SAN from multiple disk and array failures without sacrificing performance.
- RAID 6: LeftHand supports hardware RAID 6 in the array, which protects against double disk faults and BER events during rebuild.
- Background Scrubbing: LeftHand uses non-intrusive background disk scrubbing that corrects defective drive disk sectors and BER events before they cause a drive failure.
- Network RAID: Stripes and replicates data across the SAN with local and remote options. Network RAID protects data from double disk faults in a RAID set, complete array/node failures, and even site failures in a Campus or Multi-site SAN configuration. LeftHand is the only storage vendor offering this advanced level of data protection.
Since SATA drives have historically had much higher BERs than Fibre Channel and SAS drives, many vendors recommend RAID-6 for SATA drives if they’re used for important data. LeftHand not only provides RAID 6, but also provides even more advanced data protection with Network RAID. These features can protect your data from a multitude of failure possibilities.
- RAID 6: LeftHand supports hardware RAID 6 in the array, which protects against double disk faults and BER events during rebuild.
Comments