HDD malfunctions

 “Nothing is eternal” – that expression applies also to hard disk drives. No matter how reliable a HDD is still it is degraded with time by destructive processes.

 First, a drive is a mechanical and electronic device but all mechanical parts gradually wear out. With time connections between mechanical parts become slack. Numerous ascensions and descents of magnetic heads which occur during each start and stop of magnetic disk rotation destroy the protective layer coating the heads. However, modern manufacturing technology guarantees rather long life for hard drives. Thus, according to the information from the technical manual for operation of Western Digital drives (Caviar BB/JB family) the minimum number of contacts between magnetic heads and disk surface during start/stop (Contact Start/Stop Cycles – CSS) is at least 50000 cycles, while unrecoverable reading errors (Error Rate – Unrecoverable) appear less frequently than once per 10 bytes raised to the 14th power. If we translate those figures into generally understandable terms we receive the following: minimum time before any deterioration in the quality of heads or surfaces because of their contacts provided that the drive is switched on and off ten times daily will be 14 years; and one error will occur during reading of more than 32 TB of data (that approximately corresponds to viewing movies in MP4 format non-stop for 7 – 10 years).

Still, in real life we frequently face a totally different situation when a brand new drive purchased recently goes out of order after a few months of operation. Numerous drives even do not endure the warranty period defined by their manufacturing factory. We have to note that all manufacturers except for Samsung have decreased that period from 3 years to one. What are the reasons of such situation?

Normal HDD ageing malfunctions
 During correct operation of a properly assembled drive performed in conformity to all requirements of its Technical Reference Manual with time you can observe normal ageing process. It tells most badly on magnetic disks. First, with time the magnetization of minimum magnetic “prints” – dibits – decreases and a drive has to re-read some portions of disks, which used to read flawlessly, or they even begin to produce reading errors. In the second place, the magnetic layer on disks also deteriorates gathering scratches, chippings, cracks, etc. All of the above cause appearance of BAD sectors.

The process of normal drive ageing is quite long and usually it takes 3-5 years. We have to note that for a HDD non-stop mode of operation is even more favourable than a mode, when a drive starts and stops frequently. Thus drives function quite long in dedicated servers operating round-the-clock and located in a separate premise or a box with obligatory normal climate control.

Malfunctions resulting from incorrect mode of operation
 The most frequent cause of HDD malfunctions has to deal exactly with incorrect manner of their operation, its main destructive factors include: overheating, mechanical impacts and voltage jumps of HDD power supply.

Overheating is caused by insufficient cooling of drive case and PCB. According to the technical reference manual for Western Digital drives (Caviar BB/JB family) the allowed operational drive temperature ranges from 5 С to 550 С provided that air circulates around all the time. The latter condition is determined by the fact that some chips on the control board become much warmer than the above temperature (motor controllers, etc.) and heat dissipation must be arranged for them. Now let us imagine that it is summer time, temperature inside may reach 30 С, within computer case it will grow to the extreme values – by another 20 – 250 С – while there is no normal air circulation because there is only one blow-out fan in the power supply clogged with dust, flat cables inside form a tight knot and the drive is blocked from both sides between a CD drive and FDD. An open computer case at that does not remedy the situation because it does not facilitate air flow around HDD.

Another important temperature value is its gradient, which should not exceed 200 С per hour during operation and 300 С during downtime. When the latter is exceeded, it is very dangerous for drive mechanics; that phenomenon is called thermal shock. Thus if you bring a HDD during winter time from a store or from a friend (where you had to read some necessary data) and it is frosty outside and 200 С inside, then if you power-up the drive immediately it causes sudden local heating of separate mechanical HDA parts, which may cause micro deformations of precise drive mechanics. Such a drastic temperature drop is very harmful for electronic components, too.

The same holds true regarding mechanical influence over HDA, i.e. impacts which are also very dangerous for precise mechanical parts of a drive. During operation as described in the previous article, spring-loaded magnetic heads fly at a low height above disks rotating at a rather high speed. An impact against HDA in that situation will cause inevitable vibration of heads which will produce a series of hits against disks, which in turn are sure to cause chipping both on disk surface and on the surface of magnetic heads.

Very serious danger for HDD electronics is manifested by power supply units powering the whole PC and the drive respectively. In order to make their price lower manufacturers frequently do not install filtering circuitry both in the primary 220 V chain and in secondary circuit. Very frequently rated power does not correspond to the actual values and stabilized voltage turns out to be not so stable although those parameters are strictly regulated for disk drives. Thus, according to the technical reference manual for Western Digital drives (Caviar BB/JB family) allowed power supply voltage is +5 V +- 5% and +12 V +- 10%, allowed fluctuation is 100 mV in +5V circuits and 200 mV in 12 V circuits. Most specialists servicing computer equipment use only voltage meters while testing power supply units, but one should keep in mind that voltage fluctuations, which are an important parameter can be checked with an oscilloscope only.

Construction-related malfunctions
 Quality of HDDs has decreased lately; that fact is confirmed by reduction of warranty period by many manufacturers. To some extent it is caused by stiff competition between them and the resulting race for production of cheap drives. It is also connected with growing technological standards, a sort of a race for density increase and achievement of higher capacity per disk. As a consequence vendors frequently use in their HDDs solutions, materials and technologies, which have not been thoroughly tested and verified; thus imperfect products appear in the market and then in possession of end users. After some time manufacturers analyze malfunctions of drives returned during their warranty period and attempt to eliminate drawbacks in their construction, but those attempts are not always successful.

Theoretically such approach to drive design and production may cause problems with any drive part. We can single out the most frequent troubles:

Bad contact in pin connector between PCB and preamplifier chip connected to magnetic heads’ assembly. The consequences of a poor contact may be quite numerous. First of all, it causes appearance of bad sectors. But those sectors differ from common defects caused by poor surface quality. The difference manifests itself in the fact that the surface remains intact but bad contact causes recording of invalid data to service bytes of some sectors, e.g. to the field containing CRC code of the sector. The problem may also lead to corruption of firmware data, which cannot be restored by the drive itself during the next power-up; besides, there is no user mode for such restoration. Firmware data of a drive can be restored in the factory mode only.

Poor quality of chips’ soldering at the factory. Such workmanship flaw becomes obvious as a rule approximately after a year of drive operation. It is usually manifested in lack of contact, i.e. after some period of normal operation a drive either switches off and does not start again (“hangs”) or begins to produce knocking sounds with its heads; the latter situation may result in damage to its mechanical parts. Just like the previous flow it may also cause firmware corruption.

Insufficient quality of chips becoming defective even at heating values, which do not exceed allowed limits. The fault can be repaired by replacing the defective chip with an identical operational one.
Imperfect construction of fluid dynamic bearings, which causes accumulation of scrap particles in the grease resulting in spindle motor seizure.

There are also cases when disks are not fixed on a spindle properly, as a result disk beating grows increasingly and causes bearing destruction in spindle motor. Considerable noise begins to accompany drive operation and after some time defective sectors appear because disk beating leads to incorrect reading of some tracks.

Poor quality of Flash ROM chips, which may lose the firmware code stored therein because of charge leakage when heated. ROM can be overwritten either in a special ROM chip programmer or using the drive itself in the factory mode.

Errors in drive firmware microcode. Manufacturers do not make public the information about the nature of such errors keeping it secret. However, firmware updates are issued quite regularly. It would be a mistake to believe that the errors do not influence drive’s operability in any way because in some cases they may result in damage to drive mechanics.