Friday, November 16, 2007

Sometimes it is the first time: Disc crashed in RAID

I cannot say that I never experienced a disc crash so far. It happened extremely seldom though and having a disc crash in a RAID array was definitely a first time. And to my own shame I can't really say that I handled it with grace. Well, sometimes it is the first time and you have to learn stuff. Two days ago a customers mail server deadlocked. I got the call early next morning and indeed I was unable to do anything with the machine. The only choice left was rebooting. The machine came up but it was suspiciously slow. Turned out that the RAID array was constantly trying to resynchronize a central partition of the system. I made an error here and assumed that I should try to fiddle with the hard disk at this point since the workday was just beginning and the customer would have been quite unhappy if his customers would have been without mail. In fact I should have just ran smartctl at this point. It would have told me something like this:
> smartctl -a /dev/sdb

...

SMART Error Log Version: 1
ATA Error Count: 206 (device log contains only the most recent five errors)

...

 Error 206 occurred at disk power-on lifetime: 3931 hours (163 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 50 06 49 e0  Error: UNC at LBA = 0x00490650 = 4785744

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 08 50 06 49 e0 00      06:40:08.182  READ DMA EXT
 27 00 00 00 00 00 e0 00      06:40:08.128  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 02      06:40:08.124  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 02      06:40:06.213  SET FEATURES [Set transfer mode]
 27 00 00 00 00 00 e0 00      06:40:06.145  READ NATIVE MAX ADDRESS EXT

...
Since the partition /dev/sdb8 did cause the problem I should have now removed the disk from the RAID array by issuing:
> mdadm /dev/md8 --fail /dev/sdb8
> mdadm /dev/md8 --remove /dev/sdb8
This would have saved some trouble yesterday. Since the machine constantly tried to resync the broken partition the system was highly unstable and I had to reboot twice again. The machine went to hell the following night again, requiring a new reboot the next morning. But since we were in the weekend then I had no reservations about checking the machine anymore. I informed the server support and they swapped the disk. At that point I had to copy the partition table from /dev/sda to the fresh /dev/sdb using fdisk:
> fdisk /dev/sda

Command (m for Help): p

Disk /dev/sda: 250.0 GByte, 250059350016 Byte
255 Heads, 63 Sectors/Tracks, 30401 Cylinder

Device boot. Start End Blocks Id System
/dev/sda1 1 128 1028128+ fd Linux raid autodetect
/dev/sda2 129 383 2048287+ 82 Linux Swap / Solaris
/dev/sda4 384 30401 241119585 5 Extended
/dev/sda5 384 1021 5124703+ fd Linux raid autodetect
/dev/sda6 1022 1659 5124703+ fd Linux raid autodetect
/dev/sda7 1660 4209 20482843+ fd Linux raid autodetect
/dev/sda8 4210 30401 210387208+ fd Linux raid autodetect
Using n for creating new partitions and t for setting the correct partition types the layout got copied to /dev/sdb. Now in recovery mode the final step required embedding the new disc in the RAID array again:
rescue:~# mdadm --manage /dev/md1 --add /dev/sdb1
rescue:~# mdadm --manage /dev/md5 --add /dev/sdb5
rescue:~# mdadm --manage /dev/md6 --add /dev/sdb6
rescue:~# mdadm --manage /dev/md7 --add /dev/sdb7
rescue:~# mdadm --manage /dev/md8 --add /dev/sdb8
/proc/mdstat shows the progress on this while the resynchronisation runs:
rescue:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [faulty]
md1 : active raid1 sdb1[1] sda1[0]
1028032 blocks [2/2] [UU]

md5 : active raid1 sdb5[1] sda5[0]
5124608 blocks [2/2] [UU]

md6 : active raid1 sdb6[2] sda6[0]
5124608 blocks [2/1] [U_]
resync=DELAYED

md7 : active raid1 sdb7[2] sda7[0]
20482752 blocks [2/1] [U_]
resync=DELAYED

md8 : active raid1 sdb8[2] sda8[0]
210387136 blocks [2/1] [U_]
[==============>......] recovery = 70.1% (147615104/210387136) finish=18.5min speed=56316K/sec

unused devices:
Looking back I must say it was nothing dramatic at all. The only problem was my lack of knowledge on how to handle the situation. So I figured it is worth posting it.