Monday, September 16, 2013

My Ubuntu Upgrade Adventures with Raid

Solved the problem, but never expected the help to come from wikipedia. I run a server for our music, recorded TV, photos and the like. Back in 2007, I set up software raid by reading the software raid how-to. I didn't know what I was doing -- still don't actually -- but then, I was really even more clueless. I put Ubuntu 7.4 on the first drive, with separate partitions for boot, usr and var. Then I created a raid 5 array with the remaining drives, by basically just following the directions in the how-to. I did set it to email me if a drive failed - which a few did over the years, and every once in a while I'd monitor it with mdadm --status. It worked, and it kept humming along without too much trouble. Every once in a while, I'd back it up, and I was careful to never put any system-critical stuff on it. The /dev/md0 partition was mounted as /big, and /big was where you stuck your "big" stuff, like your entire collection of futurama videos. About two terabytes later, I upgraded to Ubuntu 8.0 and had to assemble the array by hand, but by Ubuntu 9, it was automatically recognizing the array, and the upgrade process was simply taking care of things. Guys, this is the dangerous part! When some nice helpful feature just takes care of things, and then it suddenly fails, you get to figure out why, and it's harder, if it's been coddling you all along! Last weekend, I decided to upgrade to 12.4, which I did following the release notes. At first the machine sinply froze and did not appear to boot at all. But it turned out that the "quiet" which was helpfully added to the kernel's boot command line, kept an important message from appearing, an error from mdadm about the array not working. I was able to boot with GRML and fix that. But now the server booted, gave the error, and went in to single-user mode. You could type exit, and get it to fully boot, but to get it booting without needing any interaction, I removed /dev/md0 from the fstab. I had to fix a few more upgrade glitches, but the system was back up and running. All except for the raid array; it had completely disappeared. I googled around and trawled through thousands of forum posts. I read about people who foolishly installed their entire system on a raid array and was glad I had the sense not to do that. I fooled with udev and uuids and mdadm.conf and a lot of other twisty passages which did not turn in the direction of success! I tried lots of things, and it's a waste to detail them all here. The problem was that dmesg would randomly show that the kernel had found some, but not all of the drives: [ 4.214684] sd 3:0:0:0: [sdc] 976773168 512-byte logical blocks: (500 GB/465 GiB) [ 4.214908] sd 3:0:0:0: [sdc] Write Protect is off [ 4.214915] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [ 4.214950] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 4.218076] sdc: sdc1 [ 4.222451] sd 3:0:0:0: [sdc] Attached SCSI disk [ 4.709196] md: bind [ 7.537396] md/raid:md0: device sdc1 operational as raid disk 1 [ 7.538315] disk 1, o:1, dev:sdc1 which is what's supposed to happen, but with most of my other drives, they appeared as unpartitioned. And running mdadm --detail /dev/md0 gave output like this: State : active, FAILED, Not Started Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Sometimes two of the drives would come up, but that was it. Fdisk however, showed each drive with a single raid partition Device Boot Start End Blocks Id System /dev/sdb1 63 976768064 488384001 fd Linux raid autodetect I had resigned myself to having to boot with GRML and back everything up, but alas, GRML couldn't see the array either. What was weird was that GRML saw *DIFFERENT* drives but not all of them! Had I lost my husband's beloved South Park collection, I wondered? In desperation I read the wikipedia article on mdadm which is where my answer lay. >A common error when creating RAID devices is that the dmraid-driver has taken control of all the devices that are to be used in the new RAID device. Error-messages >like this will occur: >mdadm: Cannot open /dev/sdb1: Device or resource busy >Typically, the solution to this problem involves adding the "nodmraid" kernel parameter to the boot loader config. Another way this error can present itself >is if the device mapper has its way with the drives. Issue 'dmsetup table' see if the drive in question is listed. 'dmsetup remove ' will remove >the drive from device mapper and the "Device or resource busy" error will go away as well. This is what programmers call a "race" condition, where two realtime processes "race" to see which one will be first. In this case, the raid detection built in to the kernel was racing to build an array, while mdadm was also trying to do the same thing. I have such a love-hate relationship with Linux -- we really should see a marriage counselor! More futzing and I figured out which file to edit for the new grub (hint it's not menu.lst) and I was able to add nodmraid to the kernel command line. At last, I finally feel like a real system administrator. I still don't understand device mapper, and I wish I had a better grasp on UUIDS, Udev, chroot and all the newest linux bafflegab. Sometimes it seems like most of the documentation, and most of what I thought I used to know has been depricated.

No comments: