Software RAID on Linux – which drive failed? | Marcin's rootprompt blog

SATA box Let’s just face it – your hard disks WILL fail. You know it. You just don’t know when. If you do not set-up at least RAID-1 (mirror) you WILL lose some data and certainly your service WILL be down.

OK, we all know it, and most probably you are running a RAID-1/5/6. And you are ready to face the e-mail saying:

A DegradedArray event had been detected on md device /dev/mdX.

However before this happens – as you set-up your mdadm monitor you should also set-up smartctl to periodically test and check state of your drives. Then, before the above happens you might get:

Device: /dev/sdX [SAT], 1 Currently unreadable (pending) sectors

And:

Device: /dev/sdX [SAT], Self-Test Log error count increased from 0 to 1

This way you will buy some time. How much? I guess not much really but you get the picture before your array degrades.

Did you set-up your hot spares? If so, you might just remove the faulty disk out of the array(s) and let the spare resynchronize and take over. This buys you again some time, since your array will be up and running, with the exception of your spare drive. Depending on your strategy – don’t forget to get a new spare.

Bear in mind however than if you happened to buy your disks as identical models and at the same time – chances are, that while the array will be resynchronizing you might be hit by one of your healthy (but old enough) drives. This has happened to many people. So be careful and do not underestimate a good (accessible and readable) backup along with a disaster recovery plan. Do you check your backup’s health?

Another thing you might stumble upon when your disk fails is the actual placement of the disk. If you are running a server platform this will be not much of an issue, however if your array is running in a PC-based setup you will certainly be thinking: “which one of these it actually is?” Try checking:

# ls -la /dev/disk/by-path

first. If you are able to enumerate your controllers, and your controller’s ports then the problem is solved. What if the ports on your motherboard or controller are marked with small symbols you are just unable to see in the web of cables and cards? And yes, you see the model and serial number in /dev/disk tree, but don’t see the stickers on your drives – and you did not make note when putting them in, did you? Moreover, hard drives do not ship with indicator led these days, so running some activity to see which one blinks is no longer an option.

Or is it? Actually it might be. Here’s a hint: temperature. Seriously. But use these techniques wisely on a degraded array as you certainly don’t want to break other (probably as old) disks by stressing them. You could play with dd on your array’s components and see your drives temperature rising. If you are able to rise the temperature of one disk about 20 degrees over other drives’ then you might just touch it and sense the right one. Repeat as necessary. Then replace.

Another method based on temperature would be to spin down the faulty drive with something like:

# sdparm --command=stop /dev/sdX

Remember to make sure nothing wakes it up accidentally (e.g. a periodic smartctl check) An advantage of this method is that you do not stress your risky disks by heating them up. A disadvantage is that the drive will cool down to ambient temperature which might be not different enough from other drives in the box – in other words you might not sense the difference by touching it.

And some footnotes:

On your RAID-1 mirror don’t forget to install your bootloader to the new drive if you mirror your boot drive(s). Too many times people forget this. Example for grub and drives sda and sdb running mirrored /boot partition (as first partition):

device (hd0) /dev/sda
root (hd0,0)
setup (hd0)
device (hd0) /dev/sdb
root (hd0,0)
setup (hd0)

Another hint when you recover your array situation – don’t forget to label your cables if you did not do this before. And if it’s not too late for reassembly – take notes and make a sticker with your drives’ information (model, serial number) and put it in the box for future reference.

Good luck!

Post image by xav.com

Marcin's rootprompt blog – /dev/web0

Software RAID on Linux – which drive failed?

0 Responses to “Software RAID on Linux – which drive failed?”

What do you think? Cancel reply

Follow Blog via Email

Recent Posts

Archives

Meta

Marcin Gałkowski

Latest Tweets

BTC tip jar: