RAID Array Failures & Recovery

RAID Arrays Failures & RecoveryA hardware RAID implementation requires at minimum a RAID controller. On a desktop system this may be a PCI expansion card, PCI Express expansion card or built into the motherboard. Controllers supporting most types of drives may be used – IDE/ATA, SATA, SCSI, SSA, Fibre Channel, sometimes even a combination. The controller and disks may be in a stand-alone disk enclosure, rather than inside a computer. The enclosure may be directly attached to a computer, or connected via SAN. The controller hardware handles drive management and performs any parity calculations required by the chosen RAID level.

Hardware RAID Failures:

  • Actuator Failure
  • Bad sectors
  • Controller Failure
  • Controller Malfunction
  • Corrupted RAID
  • Lightning, Flood and Fire Damage
  • Damaged Motor
  • Drive physical abuse
  • Hard disk component failure and crashes
  • Hard disk drive component failure
  • Hard drive crashes
  • Hard drive failure
  • Head Crash
  • Intermittent drive failure
  • Media Damage
  • Media surface contamination
  • Multiple drive failure
  • Power Spike
  • Power Supply Burn out or failure
  • RAID controller failure
  • RAID corruption
  • RAID disk failure
  • RAID disk overheat
  • RAID drive incompatibility
  • RAID drive overheat
  • RAID Array failed
  • Vibration damage

Hardware RAID Failures(Human Error):

  • Unintended deletion of files
  • Reformatting of drives / Array
  • Reformatting of partitions
  • Incorrect replacement of media components
  • Accidentally deleted records
  • Mistaken overwritten database files
  • Employee sabotage
  • Lost/Forgotten password
  • Overwritten files
  • Overwritten RAID config files
  • Overwritten RAID settings
  • RAID incorrect setup
  • RAID user error

Software RAID implementations are now provided by many operating systems. Software RAID can be implemented as:

  • layer that abstracts multiple devices, thereby providing a single virtual device (e.g. Linux’s md).
  • a more generic logical volume manager (provided with most server-class operating systems, e.g. Veritas or LVM).
  • component of the file system (e.g. ZFS or Btrfs).

Software RAID Failures:

  • Back up failures
  • Computer virus and worm damage
  • Corrupt files / data
  • Damaged files or folders
  • Directory corruption
  • Firmware corruption
  • Repartition
  • Server registry configuration
  • Missing partitions
  • RAID configuration
  • Reformatting

Software RAID Failures(Application Failure)

  • Applications that are unable to run or load files
  • Corrupted files
  • Corrupted database files
  • Data corrupted
  • Locked databases preventing access
  • Deleted tables

About RAID Data Recovery

The majority of Small-to-Medium Enterprises across the globe have turned to RAID-configured systems for their storage solutions. The most frequently cited reasons for utilizing RAID Arrays in businesses today are the highly fault-tolerant level the solution offers and the cost effectiveness of acquisition and maintenance.

However, if a RAID Array does fail due to component malfunctions (including hard drives and controller cards) or operating and application corruption, it leaves the data unusable and in most cases corrupted.

RAID data recovery is an intricate task since RAID data configurations are often custom-built applications developed by competing manufacturers. Without in-depth knowledge of how RAID arrays are configured at both a hardware, firmware and software level, data recovery attempts will not only fail, but result in further data corruption.

Read More

RAID Failures & Recovery

Correlated failures
RAID Failures & RecoveryThe theory behind the error correction in RAID assumes that failures of drives are independent. Given these assumptions it is possible to calculate how often they can fail and to arrange the array to make data loss arbitrarily improbable.

In practice, the drives are often the same age, with similar wear, and subject to the same environment. Since many drive failures are due to mechanical issues which are more likely on older drives, this violates those assumptions and failures are in fact statistically correlated. In practice then, the chances of a second failure before the first has been recovered is not nearly as unlikely as might be supposed, and data loss can, in practice, occur at significant rates.

A common misconception is that “server-grade” drives fail less frequently than consumer-grade drives. Two independent studies, one by Carnegie Mellon University and the other by Google, have shown that the “grade” of the drive does not relate to failure rates.

Atomicity
This is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote “Update in Place is a Poison Apple”[28] during the early days of relational database commercialization. However, this warning largely went unheeded and fell by the wayside upon the advent of RAID, which many software engineers mistook as solving all data storage integrity and reliability problems. Many software programs update a storage object “in-place”; that is, they write a new version of the object on to the same disk addresses as the old version of the object. While the software may also log some delta information elsewhere, it expects the storage to present “atomic write semantics,” meaning that the write of the data either occurred in its entirety or did not occur at all.

However, very few storage systems provide support for atomic writes, and even fewer specify their rate of failure in providing this semantic. Note that during the act of writing an object, a RAID storage device will usually be writing all redundant copies of the object in parallel, although overlapped or staggered writes are more common when a single RAID processor is responsible for multiple drives. Hence an error that occurs during the process of writing may leave the redundant copies in different states, and furthermore may leave the copies in neither the old nor the new state. The little known failure mode is that delta logging relies on the original data being either in the old or the new state so as to enable backing out the logical change, yet few storage systems provide an atomic write semantic on a RAID disk.

While the battery-backed write cache may partially solve the problem, it is applicable only to a power failure scenario.

Since transactional support is not universally present in hardware RAID, many operating systems include transactional support to protect against data loss during an interrupted write. Novell NetWare, starting with version 3.x, included a transaction tracking system. Microsoft introduced transaction tracking via the journaling feature in NTFS. ext4 has journaling with checksums; ext3 has journaling without checksums but an “append-only” option, or ext3cow (Copy on Write). If the journal itself in a filesystem is corrupted though, this can be problematic. The journaling in NetApp WAFL file system gives atomicity by never updating the data in place, as does ZFS. An alternative method to journaling is soft updates, which are used in some BSD-derived system’s implementation of UFS.

This can present as a sector read failure. Some RAID implementations protect against this failure mode by remapping the bad sector, using the redundant data to retrieve a good copy of the data, and rewriting that good data to the newly mapped replacement sector. The UBE (Unrecoverable Bit Error) rate is typically specified at 1 bit in 1015 for enterprise class disk drives (SCSI, FC, SAS) , and 1 bit in 1014 for desktop class disk drives (IDE/ATA/PATA, SATA). Increasing disk capacities and large RAID 5 redundancy groups have led to an increasing inability to successfully rebuild a RAID group after a disk failure because an unrecoverable sector is found on the remaining drives. Double protection schemes such as RAID 6 are attempting to address this issue, but suffer from a very high write penalty.

Write cache reliability
The disk system can acknowledge the write operation as soon as the data is in the cache, not waiting for the data to be physically written. This typically occurs in old, non-journaled systems such as FAT32, or if the Linux/Unix “writeback” option is chosen without any protections like the “soft updates” option (to promote I/O speed whilst trading-away data reliability). A power outage or system hang such as a BSOD can mean a significant loss of any data queued in such a cache.

Often a battery is protecting the write cache, mostly solving the problem. If a write fails because of power failure, the controller may complete the pending writes as soon as restarted. This solution still has potential failure cases: the battery may have worn out, the power may be off for too long, the disks could be moved to another controller, the controller itself could fail. Some disk systems provide the capability of testing the battery periodically, however this leaves the system without a fully charged battery for several hours.

An additional concern about write cache reliability exists, specifically regarding devices equipped with a write-back cache—a caching system which reports the data as written as soon as it is written to cache, as opposed to the non-volatile medium. The safer cache technique is write-through, which reports transactions as written when they are written to the non-volatile medium.

Equipment compatibility
The methods used to store data by various RAID controllers are not necessarily compatible, so that it may not be possible to read a RAID array on different hardware, with the exception of RAID 1, which is typically represented as plain identical copies of the original data on each disk. Consequently a non-disk hardware failure may require the use of identical hardware to recover the data, and furthermore an identical configuration has to be reassembled without triggering a rebuild and overwriting the data. Software RAID however, such as implemented in the Linux kernel, alleviates this concern, as the setup is not hardware dependent, but runs on ordinary disk controllers, and allows the reassembly of an array. Additionally, individual RAID1 disks (software, and most hardware implementations) can be read like normal disks when removed from the array, so no RAID system is required to retrieve the data. Inexperienced data recovery firms typically have a difficult time recovering data from RAID drives, with the exception of RAID1 drives with conventional data structure.

Data recovery in the event of a failed array
With larger disk capacities the odds of a disk failure during rebuild are not negligible. In that event the difficulty of extracting data from a failed array must be considered. Only RAID 1 stores all data on each disk. Although it may depend on the controller, some RAID 1 disks can be read as a single conventional disk. This means a dropped RAID 1 disk, although damaged, can often be reasonably easily recovered using a software recovery program. If the damage is more severe, data can often be recovered by professional data recovery specialists. RAID 5 and other striped or distributed arrays present much more formidable obstacles to data recovery in the event the array fails.

Read More

Tips For Replacing A Hard Drive From A Failed RAID

Tips For Replacing A Hard Drive From A Failed RAIDThere are some items to consider when replacing a hard drive from a failed RAID. If you are building a new RAID, then all hard drives in the array should be the identical model if at all possible. However, if you must replace a failed hard drive, it can sometimes be difficult to find the same model if that model is out of production.

Below are some tips to follow when selecting a replacement:

Keep in mind that the controller may or may not allow different models in a RAID, so check the RAID controller documentation.

Product life: What is the expected life of the remaining drives? If the other drives are approaching the end of their useful life, then it may be time to replace the entire RAID.

Capacity: The replacement drive should be the same or higher capacity than the original drive. Do not just look at the capacity on the box, since a few megabytes could make the difference between whether the drive will work or not.

(You should check the number of LBAs (or sectors) on the hard drive. Some RAID controllers will allow you to substitute larger drives if the exact capacity is not available, while other controllers require an exact match. Check with the controller manufacturer if the documentation doesn’t make it clear!)

Performance: The replacement drive should match the performance of the remaining drives as closely as possible. If your failed drive was 15,000 RPM, avoid replacing it with a 10,000 RPM drive. RAID arrays depend on the timing between drives to write data. Thus, if one drive doesn’t keep up, it may cause the entire array to fail or at least experience irritating problems.

Interface: Make sure the replacement drive uses the same type of interface connection as the failed drive. If the failed drive used a SCSI SCA (80-Pin) interface then don’t try to replace it with a 68-pin SCSI interface. With Seagate products the last two digits of the model number indicate the interface. For example: LW = 68-Pin, LC = 80-Pin.

The 80-pin LC drives are hot-swappable with backplane connections.

Cache Buffer: It is recommended that the cache buffer for each drive be the same value.  Most RAID controllers will consider drives with mismatching cache buffers to be ineligible for addition to a striped or parity array.

Read More

Raid Data Recovery Tips

Raid Data Recovery A large number of users had been made into believing that RAID should not fail, as a result of over emphasis of RAID’s fault tolerance functions or auto rebuilt functions. As a result, up to date backups are seldom performed when the data disaster nightmare unfolds.

RAID may be implemented by hardware or software -based method, differentiated by the presence or absence of a RAID controller, Basically, a number of independent hard disks are connected to form a single and often larger virtual volume. Depending on the RAID configuration, there may be an increase in simultaneous reading and writing of drives along with the fault tolerance feature.

Popular RAID manufacturers such as Mylex, Adaptec, Compaq, HP, IBM etc. promotes the idea of extended data availability and protection when a failed hard disk was detected. In a typical RAID 5 configuration, without even power off, the RAID controller could rebuild the data volume from a hot standby drive or a replacement drive through hot swap. The only time it will fail is when two disks failed simultaneously but such probability is one in a million! As a result, one may tend to believe that RAID can not fail.

The reality: RAID fails

In reality and to the surprise of most, RAID could fail and often fail. See some typical scenario below :

When one hard disk fails, very often, there is no hot standby. As a result, the raid array is running on degraded mode. While waiting for the replacement drive which may take a day or two, the likelihood of next drive failure disabling the raid volume is very high. It is reasonable to assume that all the drives in the array are from the same batch and subject to equal amount of working stress. So if one disk fails, the other is also near imminent failure and it often does.

Most raid server has a single controller. Its failure will result in catastrophic single point of failure.

Frequently, due to power surge, the controller or a number of disk elements could fail resulting in total loss of data. It is also found that a power surge may corrupt the RAID configuration setting of NVRAM in the controller card.

It is also commonly found that while replacing a faulty drive in an attempt to rebuild the raid volume to healthy state, wrong procedures are performed resulting in wrong or partial rebuild, or complete system breakdown upon completion of rebuild.

Not to forget that a RAID configuration with fault tolerance at best only intends to protect the physical failure, but not logical corruption such as system corruption, virus infection, or inadvertent deletion.

Types Of RAID failures

To summarize, RAID server often fails as a result of the following situations and frequently, a combination of them :

  • Malfunctioned Controller
  • Raid rebuild error or volume reconstruction problem
  • Missing RAID partition
  • Multiple disk failure in off-line state resulting in loss of RAID volume
  • Wrong replacement of good disk element belonging to a working raid volume
  • Power Surge
  • Data Deletion or reformat
  • Virus Attack
  • Loss of RAID configuration settings or system registry
  • Inadvertent reconfiguration of RAID volume
  • Loss of RAID disk access after system or application upgrade

RAID Pricing

In general, pricing of raid recovery starts from $1500 onwards and will be more as the situations get more complex.

RAID Recovery Process

Though raid disk arrays offer more redundancy, capacity and performance over standard disk systems, once failed, they are often complex and more difficult to recover.

Normally, we only require the hard disks making up the raid volume in order to recover the lost data.

The process begins by looking at the kinds of failure occurred in a RAID volume. If the RAID server failure is due to multiple failed disks, effort will be spent in getting the failed disks backed to ready state.

The disk image or the low level binary contents of each disk are then copied out. Next, analysis is performed on the disk images. A process of de-stripping will be carried out on each of the extracted disk image upon confirming the RAID types, correct orientation of disk elements forming the RAID volume, the raid strip block size, the associated parity location etc. Different manufacturers may have slightly different RAID settings so additional fine tunings may be needed. Very often, file system repair must also be performed before the data location could be mapped out correctly.

Once the data layout pattern making the RAID logical volume has been identified and confirmed, the critical data will then be uplifted into other disk media. The data integrity is then evaluated to ensure that the data is of acceptable quality before a file list is finally produced for customer review.

Raid Data Recovery Software: Getway Raid Recovery V2.1

Getway Raid Recovery is the professional Raid Data Recovery Software which can extract data from multiple Hard disks in a RAID system, and rebuild the correct data. It can get data back from various types of arrays, including RAID 0, RAID 5, RAID 5E, RAID 5EE and RAID 6.

Read More