Skip to content

            Lost ?  |  Need an account:
 
Home >> Knowledge Base >> Operating Systems >> Linux >> Removing Failed RAID Devices
Removing Failed RAID Devices PDF Print E-mail
(3 votes, average 5.00 out of 5)
Written by Tom Hirt   
Wednesday, 17 June 2009 08:34

 

How-to Remove Failed RAID Devices


In this KB, we will discuss how to recover from the loss of a device in a Linux software RAID array.  We will demonstrate how to manually fail a disk, remove it and then re-add and rebuild the array.

It's inevitable that a device in your RAID array will eventually fail.  Replacing the failed device should be done as soon as possible, as different levels of RAID have varying abilities to sustain device loss (see our Linux RAID How-to for a description of the different RAID levels and their sustainability with failed devices.)  Let's begin by inspecting our array:

[root@Linux01 /]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc1[1] sdb1[0] sdd1[2]
8385664 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

unused devices:
[root@Linux01 /]#

The UUU indicates that all three devices in the array are up and online.

We are now going to simulate a disk failure by manually failing the second device (sdc1) in the array (/dev/md0).  Under most circumstances, should a device fail, the Linux RAID subsystem should detect the failure and automatically mark the disk failed.  You should not have to manually set the device as failed unless you already suspect issues with the disk.

[root@Linux01 /]# mdadm /dev/md0 --fail /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0
[root@Linux01 /]#

Note: In order to remove a device from the array, it must be marked as faulty.

We can now verify the disk has been marked faulty by inspecting /proc/mdstat

[root@Linux01 /]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc1[3](F) sdb1[0] sdd1[2]
8385664 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]

unused devices:
[root@Linux01 /]#

Note: You will notice the (F) next to the failed device with only 3/2 disks listed as "U" or up.

We can further inspect the array using the --detail command line switch with the mdadm command

[root@Linux01 ~]# mdadm --detail /dev/md0
/dev/md0:

Version : 00.90.03
Creation Time
: Tue Jun 16 13:24:49 2009
Raid Level
: raid5
Array Size
: 8385664 (8.00 GiB 8.59 GB)
Used Dev Size
: 4192832 (4.00 GiB 4.29 GB)
Raid Devices
: 3
Total Devices
: 3
Preferred Minor : 0
Persistence
: Superblock is persistent
  :
 
Update Time
: Wed Jun 17 15:51:02 2009
State
: clean, degraded
Active Devices
: 2
Working Devices
: 2
Failed Devices
: 1
Spare Devices
: 0
  :
 
Layout
: left-symmetric
Chunk Size
: 64K
  :
 
UUID
: 25ab199d:9cf31f9d:5fefdf2f:865b1a2e
Events
: 0.10
Number Major
Minor
RaidDevice
State
0
8
17 0
active sync   /dev/sdb1
1
0 0 1 removed
2
8 49 2 active sync   /dev/sdd1
.
3
8 33 - faulty spare   /dev/sdc1
Note: This array is degraded because /dev/sdc1 is marked faulty

We can now remove the failed device (/dev/sdc1) from the array

[root@Linux01 /]# mdadm /dev/md0 --remove /dev/sdc1
mdadm: hot removed /dev/sdc1
[root@Linux01 /]#

Once the device has been removed, you can replace the faulty disk.  I'll caution you that if you plan to hot swap the disk, you could fry your hardware and even worse, loose the entire array.  If at all possible, shutdown the array and replace the failed disk with the server powered off.  If you must perform a hot swap, most SCSI controllers should support hot swapping (use with caution) however SATA support for host swapping is still only limited to a handful of device drivers (see http://linux.yyz.us/sata/sata-status.html for a full list of drivers that support NCQ.)  That said, SATA hot swapping is strongly discouraged so proceed with extreme caution.


Adding Device to a RAID array


After you have replaced the failed device, you can re-add the new device back into the array which will automatically initiate a rebuild.

Begin by creating a primary partition of type fb on the new device.

Note: the Linux RAID subsystem only supports partitions of type fb

[root@Linux01 /]# fdisk /dev/sdc

Command (m for help): n
Command action
e   extended
p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-522, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-522, default 522):
Using default value 522

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (Unknown)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
[root@Linux01 /]#

Once the device has been partitioned, you can re-add the device to the array

[root@Linux01 /]# mdadm /dev/md0 --add /dev/sdc1
mdadm: re-added /dev/sdc1
[root@Linux01 /]#

Monitor the rebuild of the array watching /proc/mdstat

[root@Linux01 /]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc1[1] sdb1[0] sdd1[2]
8385664 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
[=========>...........]  recovery = 47.8% (2007428/4192832) finish=0.5min speed=62732K/sec

unused devices:
[root@Linux01 /]#

Once the rebuild has completed, you should once again be fault protected.  Good luck!

 


Add this page to your favorite website
AddThis Social Bookmark Button
Comments
Add New Search
Write comment
Name:
Email:
 
Website:
Title:
UBBCode:
[b] [i] [u] [url] [quote] [code] [img] 
 
 
:D:):(:0:shock::confused:8):lol::x:P:oops::cry::evil::twisted::roll::wink::!::?::idea::arrow:
 
Please input the anti-spam code that you can read in the image.

!joomlacomment 4.0 Copyright (C) 2009 Compojoom.com . All rights reserved."

Last Updated on Thursday, 18 June 2009 15:35
 

Forum Activity

Author:
Author: kenny22
Jan.24.12
Author: AaronRiley
Jan.15.12

Online Stats

Guests Online: 73
Members Online: 0