BenV's notes

Linux Software Raid disk upgrades

by on Dec.16, 2012, under Software

Every now and then you find out that this huge disk you’ve been using — you know, the one that when you bought it you thought “How on earth am I ever going to fill this one up? My biggest game can fit on this disk 100 times!” — … isn’t as huge anymore. Or at least all the free space on it has disappeared and nagios is whining that your disk is full or about to explode.
Some background info: My fileserver here at home has 3 linux software raid arrays (raid-1 mirrors) on top of 4 physical disks. The first and also smallest array is used as root filesystem to boot from into Slackware linux. The second and third arrays are both big and simply for storage of games, music, series, etc.
When I created that first array a few years ago I figured “Hm, 20GB should be enough for a slackware install, right? Well, let’s just make it 50GB to be sure, we have plenty of space anyway on this huge disk“. Back then the ‘huge’ disks were 500GB. Meanwhile those 500GBs have been replaced with 1TB ones, but that array remained the same. Today I have a set of 1.5TB drives to replace the 1TB ones. Not a huge upgrade, but I didn’t have to buy these disks since they came from a server that had its drives upgraded as well. Anyhow, the 50GB partition managed to get filled with over 40GB of stuff that I can’t trash (mostly user home directories). I could move them to a different partition of course, but today we’re going to resize that partition to 100GB and put the rest in the storage partition.
Off-topic note: Do you also hate it when you’re typing in a browser and hit CRTL-w to delete your last word and realize you just closed your tab? I sure as heck do, good thing wordpress saves these drafts every now and then 🙂

Old situation

First an overview of the old situation. Take a look at our friend /proc/mdstat:

root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdc1[2] sdb1[1]
48829440 blocks [2/2] [UU]

md4 : active raid1 sdc2[0] sdb2[1]
927737600 blocks [2/2] [UU]

md2 : active raid1 sdd1[1] sda1[0]
1464846720 blocks [2/2] [UU]

unused devices:

Don’t ask about the weird raid device numbering, that’s a remnant from the last upgrade and a fight with mdadm on a (PXE booted) Partition Magic 😉
Ignore md2 today, we’ll focus on md1 which is my Slackware boot/root partition and md4 which is the storage partition. Both disks (sdb and sdc) eventually need to be removed from this PC, but the machine has room for temporarily adding another harddisk.

Upgrade Plan

The new disks I have are 1.5TB WD Greens. Some people will claim that these are no good for raid arrays, but in my experience they’re just fine. (and I’ve been using these WD Greens for quite a while now, starting with the 500GB ones. They did have a firmware issue or two with auto-shutdown every 8 seconds when they were brand new, but that’s been resolved).
The plan is simple:

  1. Add new disk
  2. Partition new disk
  3. Add new disk to old array
  4. Remove old disk. Repeat steps 1-4 for disk 2.
  5. Resize array to new disk size

Sector size

One interesting thing to note about the newer WD Green disks, the ones that have a model ending in EARS like my WD15EARS-00MVWB0 is that they use a 4KB sector size. This shows up in hdparm like this:

root@server:~# hdparm -I /dev/sda | egrep 'Model|Sector|Firm'
Model Number: WDC WD15EARS-00MVWB0
Firmware Revision: 51.0AB51
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
root@server:~# hdparm -I /dev/sde | egrep 'Model|Sector|Firm'
Model Number: WDC WD15EARS-00MVWB0
Firmware Revision: 51.0AB51
Logical/Physical Sector size: 512 bytes

Note the difference between the drives, apparently it’s possible for the drives to lie to the OS. There’s no jumpers on the physical drive, so this is not a forced compatibility mode because of that, but it could be a BIOS thing. I attached sde when the system was already running on a SATA port that hadn’t been used so far.
The sector size has an noteworthy implication though: if your partition is not aligned on a 4KB boundary your disk might get pretty slow because it has to read/write two sectors for block that needs to be fetched from the filesystem! Some more info on 4KB sector sizes here on wikipedia under Advanced Format.

Action!

Enough theory, time to make some changes. First I’ve attached the new drive. As you can see above the new drive is called sde. Time to partition the new drive. For partitioning I’ll used parted, which is in the Slackware package repository these days. I’ll put the first partition at an offset of 1 megabyte. This fixes both the 4KB alignment and also makes sure Grub has enough room to put it’s core image in the MBR (I had some issues with this after upgrading to grub 1.99 from 1.98).

root@server:~# parted /dev/sde
(parted) unit GiB
(parted) mkpart primary 1MiB 100GiB
(parted) mkpart primary 100GiB 1395GiB
(parted) set 1 boot on
(parted) set 1 raid on
(parted) set 2 raid on
(parted) print
Model: ATA WDC WD15EARS-00M (scsi)
Disk /dev/sde: 1397GiB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 0.00GiB 100GiB 100GiB primary boot, raid
2 100GiB 1395GiB 1295GiB primary raid

(parted) q

Next I’ll add the new partitions to the raid-1 arrays and let the arrays get back in sync with the new member. This will only use part of the new member’s partition, since the array doesn’t care that the partition is bigger (and can’t atm, even if it wanted to).

root@server:~# mdadm --add /dev/md1 /dev/sde1
mdadm: added /dev/sde1
root@server:~# mdadm --fail /dev/md1 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md1
# This is a good moment to check if your mdadm monitoring works,
# you should receive an email now 😉
root@server:~# mdadm --remove /dev/md1 /dev/sdc1
mdadm: hot removed /dev/sdc1 from /dev/md1
# Note that above commands can be combined, but it's possible to
# receive a 'mdadm: hot remove failed for /dev/sdc1: Device or
# resource busy' error because mdadm tries to update the
# superblock of sdc1.

root@server:~# watch -n1 cat /proc/mdstat
# -- wait for resync to complete --

# Now do the same for other arrays this disk is in
root@server:~# mdadm --add /dev/md4 /dev/sde2
mdadm: added /dev/sde2
root@server:~# mdadm --fail /dev/md4 /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md4
root@server:~# mdadm --remove /dev/md4 /dev/sdc2
mdadm: hot removed /dev/sdc2 from /dev/md4

root@server:~# watch -n1 cat /proc/mdstat
# -- wait for resync to complete --

root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sde1[0] sdb1[1]
48829440 blocks [2/2] [UU]

md4 : active raid1 sde2[0] sdb2[1]
927737600 blocks [2/2] [UU]

md2 : active raid1 sdd1[1] sda1[0]
1464846720 blocks [2/2] [UU]

unused devices:

Note that first adding a new disk and then fail/removing a member is exactly the same as first fail/removing a member and then adding the new disk. During that point your array will be running from a single disk. If that disk happens to fail at that moment: kaboom – you’re in trouble. When you add the third disk to a raid-1 array that expects only 2 disks it’ll simply mark the new disk as a hot spare that doesn’t do jack until one of the other members fails.
In theory you can increase the number of raid-devices by doing something like ‘mdadm /dev/md1 –grow –raid-devices=3‘, but I haven’t tried. I have backups in case it goes wrong :-p

At this point you’ll either have to add another disk, or first remove one of the old disks. Mind you that sdb is still in the array, so I’ll get rid of that first and then remove both old disks at the same time.

root@server:~# mdadm /dev/md1 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1
root@server:~# mdadm /dev/md1 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md1
root@server:~# mdadm /dev/md4 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md4
root@server:~# mdadm /dev/md4 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md4

Now it’s time to get those old 1TB disks out of my machine. Physically, because as far as Linux is concerned they’re already unused. The cool part is that we can do all of this while the machine stays up and running. Don’t you love SATA sometimes? 😉 Of course we’ll have to be careful not to accidentally pull the plug on the other drives 😉 It helps to have a decent case on your system, with drive bays made for hot swap if at all possible.

After attaching the new disk you should see it being recognized by Linux, something like this in dmesg:

[11851086.343396] ata5: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[11851086.343400] ata5: irq_stat 0x00000040, connection status changed
[11851086.343403] ata5: SError: { CommWake DevExch }
[11851086.343410] ata5: hard resetting link
[11851087.063439] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[11851087.066759] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)
[11851087.066768] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node ffff880215060be0), AE_NOT_FOUND (20120320/psparse-536)
[11851087.075738] ata5.00: ATA-8: WDC WD15EARS-00MVWB0, 51.0AB51, max UDMA/133
[11851087.075742] ata5.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[11851087.079146] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)
[11851087.079154] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node ffff880215060be0), AE_NOT_FOUND (20120320/psparse-536)
[11851087.079310] ata5.00: configured for UDMA/133
[11851087.079316] ata5: EH complete
[11851087.079424] scsi 4:0:0:0: Direct-Access ATA WDC WD15EARS-00M 51.0 PQ: 0 ANSI: 5
[11851087.079558] sd 4:0:0:0: Attached scsi generic sg1 type 0
[11851087.079564] sd 4:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[11851087.079692] sd 4:0:0:0: [sdf] Write Protect is off
[11851087.079697] sd 4:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[11851087.079768] sd 4:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[11851087.092760] sdf: sdf1 sdf2
[11851087.093153] sd 4:0:0:0: [sdf] Attached SCSI disk
[11851087.301767] md: bind
[11851087.309624] md: bind

Don’t ask me about those ACPI errors, I haven’t seen those before. As you can see this disk has already has cruft on it since it has already been used in another server, so we’ll wipe that first and give it a partition table just like the other new disk. Once again we’ll use parted.

root@server:~# parted /dev/sdf
(parted) print
Model: ATA WDC WD15EARS-00M (scsi)
Disk /dev/sdf: 1500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 32.3kB 40.0GB 40.0GB primary ext4 boot, raid
2 40.0GB 1500GB 1460GB primary raid

(parted) rm 2
Error: Partition(s) 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You
should reboot now before making further changes.
Ignore/Cancel? i
(parted) rm 1
Error: Partition(s) 1, 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You
should reboot now before making further changes.
Ignore/Cancel? i
(parted) q
Information: You may need to update /etc/fstab.

# We need to make it rescan the partitions so we don't get issues
# with the kernel. Parted normally does them automagically, but
# it failed somehow. Let's see why.
root@server:~# cat /proc/partitions | grep sdf
8 80 1465138584 sdf
8 81 39062016 sdf1
8 82 1426073985 sdf2
root@server:~# partprobe
Error: Partition(s) 1, 2 on /dev/sdf have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You
should reboot now before making further changes. Warning: Error fsyncing/closing /dev/md126: Input/output error
Warning: Error fsyncing/closing /dev/md127: Input/output error
# Oh swell, mdadm autobooted the existing arrays. Stop them first,
# then retry partprobe.
root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md126 : inactive sdf2[1](S)
449345495 blocks super 1.2

md127 : inactive sdf1[1](S)
39061952 blocks
# And the rest of my arrays
root@server:~# mdadm --stop /dev/md127
mdadm: stopped /dev/md127
root@server:~# mdadm --stop /dev/md126
mdadm: stopped /dev/md126
root@server:~# cat /proc/partitions | grep sdf
8 80 1465138584 sdf
# Much better
root@server:~# parted /dev/sdf
(parted) unit GiB
(parted) mkpart primary 1MiB 100GiB
(parted) mkpart primary 100GiB 1395GiB
(parted) set 1 boot on
(parted) set 1 raid on
(parted) set 2 raid on
(parted) print
Model: ATA WDC WD15EARS-00M (scsi)
Disk /dev/sdf: 1397GiB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 0.00GiB 100GiB 100GiB primary boot, raid
2 100GiB 1395GiB 1295GiB primary raid

(parted) q

Parted already ran partprobe for us, so in /proc/partitions you should see your brand new partitions. So now we can add the new partitions to our arrays and wait another day for resynching again. Yay, fun!

root@server:~# mdadm /dev/md1 --add /dev/sdf1
mdadm: added /dev/sdf1
root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdf1[2] sde1[0]
48829440 blocks [2/1] [U_]
[>....................] recovery = 0.2% (127552/48829440) finish=6.3min speed=127552K/sec

# After about 10 minutes:
root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdf1[1] sde1[0]
48829440 blocks [2/2] [UU]

md4 : active raid1 sde2[0]
927737600 blocks [2/1] [U_]

md2 : active raid1 sdd1[1] sda1[0]
1464846720 blocks [2/2] [UU]

unused devices:

# Next array 🙂
root@server:~# mdadm /dev/md4 --add /dev/sdf2
mdadm: added /dev/sdf2
root@server:~# watch -n1 cat /proc/mdstat
# Watch the progress while the array synchs again...
# taking a day orso 😉

Now it’s time for the actual resizing. Up until now all we’ve done is make two backup disks. To actually make the arrays use their extra space we’ll need to grow the raid arrays and make them take up the entire partition. Warning: If you’re using internal bitmaps on your array, this would be a good time to remove them. Leaving them there might corrupt your data! I don’t have those bitchmaps in use, so I don’t need to remove them 😉

root@server:~# mdadm /dev/md1 --grow --size=max
mdadm: component size of /dev/md1 has been set to 104856512K

Once again a resync will occur… always the same story huh 😉
After finishing another cup of coffee the array is resized (see the new block count in /proc/mdstat) and we can continue to resize the filesystem that’s on top of it. Here’s mdstat after the resize (with 104856512 blocks instead of 48829440!):

root@server:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdf1[1] sde1[0]
104856512 blocks [2/2] [UU]
# snip

The ability to do an online resize — yes, resizing a mounted read/write filesystem with open files etc — is present these days, but not for all filesystems. Fortunately I use ext4 which does have support for online resizing (at least since linux 2.6), so here goes:

root@server:~# resize2fs /dev/md1
resize2fs 1.42.6 (21-Sep-2012)
Filesystem at /dev/md1 is mounted on /; on-line resizing required
old_desc_blocks = 3, new_desc_blocks = 7
Performing an on-line resize of /dev/md1 to 26214128 (4k) blocks.
# Here it'll take a while 🙂
The filesystem on /dev/md1 is now 26214128 blocks long.

Repeat for the other raid array and filesystem and we’re done with all our data still there 🙂

Cool huh, we just removed the disks our operating system was booted on, and everything is still up and running as if nothing happened… well, except for the drives got bigger 😉




:, , ,

Archives

  • 2018 (1)
  • 2016 (1)
  • 2015 (7)
  • 2014 (4)
  • 2013 (11)
  • 2012 (27)
  • 2011 (26)
  • 2010 (25)
  • 2009 (68)