Linux Software Raid disk upgrades

by BenV on Dec.16, 2012, under Software

Every now and then you find out that this huge disk you’ve been using — you know, the one that when you bought it you thought “How on earth am I ever going to fill this one up? My biggest game can fit on this disk 100 times!” — … isn’t as huge anymore. Or at least all the free space on it has disappeared and nagios is whining that your disk is full or about to explode.
Some background info: My fileserver here at home has 3 linux software raid arrays (raid-1 mirrors) on top of 4 physical disks. The first and also smallest array is used as root filesystem to boot from into Slackware linux. The second and third arrays are both big and simply for storage of games, music, series, etc.
When I created that first array a few years ago I figured “Hm, 20GB should be enough for a slackware install, right? Well, let’s just make it 50GB to be sure, we have plenty of space anyway on this huge disk“. Back then the ‘huge’ disks were 500GB. Meanwhile those 500GBs have been replaced with 1TB ones, but that array remained the same. Today I have a set of 1.5TB drives to replace the 1TB ones. Not a huge upgrade, but I didn’t have to buy these disks since they came from a server that had its drives upgraded as well. Anyhow, the 50GB partition managed to get filled with over 40GB of stuff that I can’t trash (mostly user home directories). I could move them to a different partition of course, but today we’re going to resize that partition to 100GB and put the rest in the storage partition.
Off-topic note: Do you also hate it when you’re typing in a browser and hit CRTL-w to delete your last word and realize you just closed your tab? I sure as heck do, good thing wordpress saves these drafts every now and then 🙂

Old situation

First an overview of the old situation. Take a look at our friend /proc/mdstat:
root@server:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid1 sdc1[2] sdb1[1] 48829440 blocks [2/2] [UU]


md4 : active raid1 sdc2[0] sdb2[1]

      927737600 blocks [2/2] [UU]
md2 : active raid1 sdd1[1] sda1[0]

      1464846720 blocks [2/2] [UU]

unused devices:
Don’t ask about the weird raid device numbering, that’s a remnant from the last upgrade and a fight with mdadm on a (PXE booted) Partition Magic 😉
Ignore md2 today, we’ll focus on md1 which is my Slackware boot/root partition and md4 which is the storage partition. Both disks (sdb and sdc) eventually need to be removed from this PC, but the machine has room for temporarily adding another harddisk.

Upgrade Plan

The new disks I have are 1.5TB WD Greens. Some people will claim that these are no good for raid arrays, but in my experience they’re just fine. (and I’ve been using these WD Greens for quite a while now, starting with the 500GB ones. They did have a firmware issue or two with auto-shutdown every 8 seconds when they were brand new, but that’s been resolved).
The plan is simple:

Add new disk
Partition new disk
Add new disk to old array
Remove old disk. Repeat steps 1-4 for disk 2.
Resize array to new disk size

Sector size

One interesting thing to note about the newer WD Green disks, the ones that have a model ending in EARS like my WD15EARS-00MVWB0 is that they use a 4KB sector size. This shows up in hdparm like this:
root@server:~# hdparm -I /dev/sda | egrep 'Model|Sector|Firm' Model Number: WDC WD15EARS-00MVWB0 Firmware Revision: 51.0AB51 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes Logical Sector-0 offset: 0 bytes root@server:~# hdparm -I /dev/sde | egrep 'Model|Sector|Firm' Model Number: WDC WD15EARS-00MVWB0 Firmware Revision: 51.0AB51 Logical/Physical Sector size: 512 bytes
Note the difference between the drives, apparently it’s possible for the drives to lie to the OS. There’s no jumpers on the physical drive, so this is not a forced compatibility mode because of that, but it could be a BIOS thing. I attached sde when the system was already running on a SATA port that hadn’t been used so far.
The sector size has an noteworthy implication though: if your partition is not aligned on a 4KB boundary your disk might get pretty slow because it has to read/write two sectors for block that needs to be fetched from the filesystem! Some more info on 4KB sector sizes here on wikipedia under Advanced Format.

Action!

Enough theory, time to make some changes. First I’ve attached the new drive. As you can see above the new drive is called sde. Time to partition the new drive. For partitioning I’ll used parted, which is in the Slackware package repository these days. I’ll put the first partition at an offset of 1 megabyte. This fixes both the 4KB alignment and also makes sure Grub has enough room to put it’s core image in the MBR (I had some issues with this after upgrading to grub 1.99 from 1.98).
root@server:~# parted /dev/sde (parted) unit GiB (parted) mkpart primary 1MiB 100GiB (parted) mkpart primary 100GiB 1395GiB (parted) set 1 boot on (parted) set 1 raid on (parted) set 2 raid on (parted) print Model: ATA WDC WD15EARS-00M (scsi) Disk /dev/sde: 1397GiB Sector size (logical/physical): 512B/512B Partition Table: msdos


Number  Start    End      Size     Type     File system  Flags

 1      0.00GiB  100GiB   100GiB   primary               boot, raid

 2      100GiB   1395GiB  1295GiB  primary               raid

(parted) q
Next I’ll add the new partitions to the raid-1 arrays and let the arrays get back in sync with the new member. This will only use part of the new member’s partition, since the array doesn’t care that the partition is bigger (and can’t atm, even if it wanted to).
root@server:~# mdadm --add /dev/md1 /dev/sde1 mdadm: added /dev/sde1 root@server:~# mdadm --fail /dev/md1 /dev/sdc1 mdadm: set /dev/sdc1 faulty in /dev/md1 # This is a good moment to check if your mdadm monitoring works, # you should receive an email now ;) root@server:~# mdadm --remove /dev/md1 /dev/sdc1 mdadm: hot removed /dev/sdc1 from /dev/md1 # Note that above commands can be combined, but it's possible to # receive a 'mdadm: hot remove failed for /dev/sdc1: Device or # resource busy' error because mdadm tries to update the # superblock of sdc1.


root@server:~# watch -n1 cat /proc/mdstat

# -- wait for resync to complete --
# Now do the same for other arrays this disk is in

root@server:~# mdadm --add /dev/md4 /dev/sde2

mdadm: added /dev/sde2

root@server:~# mdadm --fail /dev/md4 /dev/sdc2

mdadm: set /dev/sdc2 faulty in /dev/md4

root@server:~# mdadm --remove /dev/md4 /dev/sdc2

mdadm: hot removed /dev/sdc2 from /dev/md4
root@server:~# watch -n1 cat /proc/mdstat

# -- wait for resync to complete --
root@server:~# cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]

md1 : active raid1 sde1[0] sdb1[1]

      48829440 blocks [2/2] [UU]
md4 : active raid1 sde2[0] sdb2[1]

      927737600 blocks [2/2] [UU]
md2 : active raid1 sdd1[1] sda1[0]

      1464846720 blocks [2/2] [UU]

unused devices:
Note that first adding a new disk and then fail/removing a member is exactly the same as first fail/removing a member and then adding the new disk. During that point your array will be running from a single disk. If that disk happens to fail at that moment: kaboom – you’re in trouble. When you add the third disk to a raid-1 array that expects only 2 disks it’ll simply mark the new disk as a hot spare that doesn’t do jack until one of the other members fails.
In theory you can increase the number of raid-devices by doing something like ‘mdadm /dev/md1 –grow –raid-devices=3‘, but I haven’t tried. I have backups in case it goes wrong :-p

At this point you’ll either have to add another disk, or first remove one of the old disks. Mind you that sdb is still in the array, so I’ll get rid of that first and then remove both old disks at the same time.
root@server:~# mdadm /dev/md1 --fail /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md1 root@server:~# mdadm /dev/md1 --remove /dev/sdb1 mdadm: hot removed /dev/sdb1 from /dev/md1 root@server:~# mdadm /dev/md4 --fail /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md4 root@server:~# mdadm /dev/md4 --remove /dev/sdb2 mdadm: hot removed /dev/sdb2 from /dev/md4
Now it’s time to get those old 1TB disks out of my machine. Physically, because as far as Linux is concerned they’re already unused. The cool part is that we can do all of this while the machine stays up and running. Don’t you love SATA sometimes? 😉 Of course we’ll have to be careful not to accidentally pull the plug on the other drives 😉 It helps to have a decent case on your system, with drive bays made for hot swap if at all possible.

After attaching the new disk you should see it being recognized by Linux, something like this in dmesg:
[11851086.343396] ata5: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen [11851086.343400] ata5: irq_stat 0x00000040, connection status changed [11851086.343403] ata5: SError: { CommWake DevExch } [11851086.343410] ata5: hard resetting link [11851087.063439] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [11851087.066759] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359) [11851087.066768] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node ffff880215060be0), AE_NOT_FOUND (20120320/psparse-536) [11851087.075738] ata5.00: ATA-8: WDC WD15EARS-00MVWB0, 51.0AB51, max UDMA/133 [11851087.075742] ata5.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA [11851087.079146] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359) [11851087.079154] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node ffff880215060be0), AE_NOT_FOUND (20120320/psparse-536) [11851087.079310] ata5.00: configured for UDMA/133 [11851087.079316] ata5: EH complete [11851087.079424] scsi 4:0:0:0: Direct-Access ATA WDC WD15EARS-00M 51.0 PQ: 0 ANSI: 5 [11851087.079558] sd 4:0:0:0: Attached scsi generic sg1 type 0 [11851087.079564] sd 4:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB) [11851087.079692] sd 4:0:0:0: [sdf] Write Protect is off [11851087.079697] sd 4:0:0:0: [sdf] Mode Sense: 00 3a 00 00 [11851087.079768] sd 4:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [11851087.092760] sdf: sdf1 sdf2 [11851087.093153] sd 4:0:0:0: [sdf] Attached SCSI disk [11851087.301767] md: bind [11851087.309624] md: bind
Don’t ask me about those ACPI errors, I haven’t seen those before. As you can see this disk has already has cruft on it since it has already been used in another server, so we’ll wipe that first and give it a partition table just like the other new disk. Once again we’ll use parted.
root@server:~# parted /dev/sdf (parted) print Model: ATA WDC WD15EARS-00M (scsi) Disk /dev/sdf: 1500GB Sector size (logical/physical): 512B/512B Partition Table: msdos


Number  Start   End     Size    Type     File system  Flags

 1      32.3kB  40.0GB  40.0GB  primary  ext4         boot, raid

 2      40.0GB  1500GB  1460GB  primary               raid
(parted) rm 2

Error: Partition(s) 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You

should reboot now before making further changes.

Ignore/Cancel? i

(parted) rm 1

Error: Partition(s) 1, 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You

should reboot now before making further changes.

Ignore/Cancel? i

(parted) q

Information: You may need to update /etc/fstab.                           
# We need to make it rescan the partitions so we don't get issues

# with the kernel. Parted normally does them automagically, but

# it failed somehow. Let's see why.

root@server:~# cat /proc/partitions | grep sdf

   8       80 1465138584 sdf

   8       81   39062016 sdf1

   8       82 1426073985 sdf2

root@server:~# partprobe

Error: Partition(s) 1, 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You

 should reboot now before making further changes.                                                                                                                                                                   Warning: Error fsyncing/closing /dev/md126: Input/output error

Warning: Error fsyncing/closing /dev/md127: Input/output error

# Oh swell, mdadm autobooted the existing arrays. Stop them first,

# then retry partprobe.

root@server:~# cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]

md126 : inactive sdf2[1](S)

      449345495 blocks super 1.2
md127 : inactive sdf1[1](S)

      39061952 blocks

# And the rest of my arrays

root@server:~# mdadm --stop /dev/md127

mdadm: stopped /dev/md127

root@server:~# mdadm --stop /dev/md126

mdadm: stopped /dev/md126

root@server:~# cat /proc/partitions  | grep sdf

   8       80 1465138584 sdf

# Much better

root@server:~# parted /dev/sdf

(parted) unit GiB

(parted) mkpart primary 1MiB 100GiB

(parted) mkpart primary 100GiB 1395GiB

(parted) set 1 boot on

(parted) set 1 raid on

(parted) set 2 raid on

(parted) print

Model: ATA WDC WD15EARS-00M (scsi)

Disk /dev/sdf: 1397GiB

Sector size (logical/physical): 512B/512B

Partition Table: msdos
Number  Start    End      Size     Type     File system  Flags

 1      0.00GiB  100GiB   100GiB   primary               boot, raid

 2      100GiB   1395GiB  1295GiB  primary               raid

(parted) q
Parted already ran partprobe for us, so in /proc/partitions you should see your brand new partitions. So now we can add the new partitions to our arrays and wait another day for resynching again. Yay, fun!
root@server:~# mdadm /dev/md1 --add /dev/sdf1 mdadm: added /dev/sdf1 root@server:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid1 sdf1[2] sde1[0] 48829440 blocks [2/1] [U_] [>....................] recovery = 0.2% (127552/48829440) finish=6.3min speed=127552K/sec


# After about 10 minutes:

root@server:~# cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]

md1 : active raid1 sdf1[1] sde1[0]

      48829440 blocks [2/2] [UU]
md4 : active raid1 sde2[0]

      927737600 blocks [2/1] [U_]
md2 : active raid1 sdd1[1] sda1[0]

      1464846720 blocks [2/2] [UU]
unused devices:

# Next array :) root@server:~# mdadm /dev/md4 --add /dev/sdf2 mdadm: added /dev/sdf2 root@server:~# watch -n1 cat /proc/mdstat # Watch the progress while the array synchs again... # taking a day orso ;)
Now it’s time for the actual resizing. Up until now all we’ve done is make two backup disks. To actually make the arrays use their extra space we’ll need to grow the raid arrays and make them take up the entire partition. Warning: If you’re using internal bitmaps on your array, this would be a good time to remove them. Leaving them there might corrupt your data! I don’t have those bitchmaps in use, so I don’t need to remove them 😉
root@server:~# mdadm /dev/md1 --grow --size=max mdadm: component size of /dev/md1 has been set to 104856512K
Once again a resync will occur… always the same story huh 😉
After finishing another cup of coffee the array is resized (see the new block count in /proc/mdstat) and we can continue to resize the filesystem that’s on top of it. Here’s mdstat after the resize (with 104856512 blocks instead of 48829440!):
root@server:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid1 sdf1[1] sde1[0] 104856512 blocks [2/2] [UU] # snip
The ability to do an online resize — yes, resizing a mounted read/write filesystem with open files etc — is present these days, but not for all filesystems. Fortunately I use ext4 which does have support for online resizing (at least since linux 2.6), so here goes:
root@server:~# resize2fs /dev/md1 resize2fs 1.42.6 (21-Sep-2012) Filesystem at /dev/md1 is mounted on /; on-line resizing required old_desc_blocks = 3, new_desc_blocks = 7 Performing an on-line resize of /dev/md1 to 26214128 (4k) blocks. # Here it'll take a while :) The filesystem on /dev/md1 is now 26214128 blocks long.
Repeat for the other raid array and filesystem and we’re done with all our data still there 🙂

Cool huh, we just removed the disks our operating system was booted on, and everything is still up and running as if nothing happened… well, except for the drives got bigger 😉