Hardware

RAID repair successful

Posted in Geek, Hacks, Hardware on December 21st, 2008 by leodirac – Be the first to comment

For everybody who has been waiting with baited breath to hear whether or not the repair of the RAID array worked or not, it did. It took several days, but since we were away on vacation seeing my dad receive the Fleming Medal from the American Geophysical Union, the waiting was pretty easy.

To convince myself that the repair was successful, I unplugged one of the previously functional drives, and saw that all my files were still there when the array was running just on the new drive and the other previous drive. I recommend this to anybody who thinks they’re running a RAID system — until you’ve seen the RAID array work with a drive removed, how can you be sure it’s really working? If your system is set up better than mine is, you’ll get some kind of warning message too.

Repairing a degraded EVMS RAID 5 array

Posted in Geek, Hacks, Hardware on December 14th, 2008 by leodirac – Be the first to comment

A while back, lightning scrambled one of the disks in my home RAID 5 array.  I figured out how to recover it.  And I got the critical data off.   Here I describe the steps I took to add a new drive and get it working with the old RAID array.  I share this with the net in hopes it will make it easier for somebody else who has to go through this process themselves, and selfishly as notes for me to refer to.  It’s a testament to the power of EVMS and a warning to anybody who thinks it might be fun to run their own open-source software RAID server at home. 

My advice for people seeking reliable storage: go with a hosted solution.  Understanding the arcane nuances of these software systems is an extremely specific skill that doesn’t translate well to many real-life necessities.  If you’re smart, you can figure it out, but it doesn’t teach you much of anything except how to do exactly that.  Each person who understands this stuff should be keeping petabytes of data happy, rather than one couple’s pictures and music collections.  I hear Microsoft’s "home server" actually makes this pretty easy, but I can’t recommend anybody willingly lock themselves into Microsoft’s business model.

Background

So I bought a new drive, following my own advice about picking drives from different manufacturers when building a raid array, and plugged it in to the mobo and booted the machine.  After futzing with /etc/fstab to get it to find the boot disk and load up, I logged into evms and got these messages:

MDRaid5RegMgr: RAID5 array md/md1 is miissing the member  with RAID index 0.  The array is running in degrade mode.

and

MDRaid5RegMgr: Region md/md1 is currently in degraded mode.  To bring it back to normal state, add 1 new spare device to replace the faulty or missing device.

Conceptually easy.  I’ve got a new 500 GB drive in the system.  Linux sees it.  It didn’t take me too long to figure out it’s called /dev/sda, while the previous 2 disks in the array are sdb and sdc, with a small boot drive at sdd.  Now the fun part is figuring out EVMS terminology enough to tell it to use the new disk.

The hierarchy of the array in EVMS land seems to be as follows:

  • Logical Volume teraraid (contains)
  • Region md/md1 (which contains)
  • Segments sdb1 and sdc1 (which are built on)
  • Logical disks sdb, sdc.

What I tried, and what seems to have worked

I see that logical disk sda has no segments.  So I try Action -> Create -> Segment.  It only gives me one choice for "Segment Manager" which is "GPT Segment Manager."  But when I choose it, it doesn’t let me make a segment on sda.  Only the tiny free space on sdb and sdc.  So sda needs something else done to it before we can use it.  What?

sda also shows up in the list of Logical Volumes, next to Teraraid and the formatted boot partition.  Hmmm.

Well I tried converting it to an EVMS Volume.  It complained that sda does not have a File System Interface Module (FSIM) associated with it, but it made the new logical volume anyway.  This wasn’t getting me anywhere.  So I erased it.

Next I tried "Add" -> "Segment Manager to Storage Object".  I noticed that all of the Disk Segments associated with the array were listed as using "Plug-in" "GptSegM" and this gave me the choice of adding Gpt Segment Manager to sda.  W00t.  I said "No" to make this a system disk.  This seems to be working.  Now I see a bunch of Disk Segments starting with sda, including a big one (465 GB) labelled sda_freespace1. 

Now when I tried to Create -> Segment, it let me use GPT Segment Manager on sda_freespace1 and allocate a 450 GB disk segment to match the others.  (I left 15 GB off each disk with the idea I could put a boot segment in that space, but I’ve never gotten around to it.)

Now in "Available Objects" there is sda1 with 450.0 ready for me.  Alrighty we’re getting there.

Now I look at "Storage Regions" and in the context menu for md/md1 I see an option that says "Add spare to fix degraded array…"  I didn’t see it there before — it might have not shown up when there weren’t any spares, or maybe I was just being thick.  In any case, selecting it now gives me a menu with one choice — sda1. 

Now in details of md/md1 it shows:


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                    Value                                      │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number            9                                          │ md│     Minor Number            1                                          │   │     Name                    md/md1                                     │   │     State                   Discovered, Degraded, Active               │   │     Personality             RAID5                                      │   │ +   Working SuperBlock                                                 │   │     Number of disks         3                                          │   │ +   Disk 1                  sdb1                                       │   │ +   Disk 2                  sdc1                                       │   │     Number of stale disks   1                                          │   │ +   Stale disk 0            sda1                                       │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

That last line about the Stale disk is new.

Actions -> Save commits these changes to disk.  Now looking at Detailed information for md/md1 shows


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                    Value                                      │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number            9                                          │ md│     Minor Number            1                                          │   │     Name                    md/md1                                     │   │     State                   Discovered, Degraded, Active, Syncing =  0 │   │     Personality             RAID5                                      │   │ +   Working SuperBlock                                                 │   │     Number of disks         3                                          │   │ +   Disk 1                  sdb1                                       │   │ +   Disk 2                  sdc1                                       │   │     Number of stale disks   1                                          │   │ +   Stale disk 0            sda1                                       │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

Emotionally I feel like I should be done now.  But I don’t hear the thrashing noise of a half-terabyte of of checksums being unwound and copied onto a fresh disk.  And it says "Syncing = 0".  Hmmm.

I quit evmsn and reload it to see two new messages.  One familiar:

MDRaid5RegMgr: Region md/md1 is currently in degraded mode.  To bring it
back to normal state, add 1 new spare device to replace the faulty or missing device.

And one novel:

MDRaid5RegMgr: RAID5 array md/md1 is missing the member  with RAID index 0.  The array is running in degrade mode.  The MD recovery process is running, please wait…

But this novel message saying it’s recovering is "Number 0" implying that it came before the other message (Number 1) which tells me I need to take action for it to fix itself.  And the drives are not thrashing.  Again I look at the details for md/md1 and now I see:


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                 Value                                         │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number         9                                             │ md│     Minor Number         1                                             │   │     Name                 md/md1                                        │   │     State                Discovered, Degraded, Active, Syncing =  0.3% │   │     Personality          RAID5                                         │   │ +   Working SuperBlock                                                 │   │     Number of disks      3                                             │   │ +   Disk 1               sdb1                                          │   │ +   Disk 2               sdc1                                          │   │ +   Disk 3               sda1                                          │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

Which really seems to say its doing its thing.  Maybe I don’t hear the disks because it’s formating the disk first, which is a linear process.  Or maybe the whole copy process is very linear and I won’t hear it thrashing.  Its progress implies it’s going to take a couple/few days to finish, which is what I’d expect.  So maybe it’s working. I’ll let it run for a while and see what happens to the array if I try to unplug one of the previously working drives.

Pretty cool that I didn’t even need to unmount the array to do this.

Now if I could just figure out why my laser printer periodically decides it needs to print it internal test page, I’d be even happier.

Greening up the Home Office

Posted in Cloud Computing, Electronic Security, Hardware, Sustainability, Tech Industry on April 22nd, 2008 by leodirac – Be the first to comment

MillerIt was pretty late at night at my friend Miller’s birthday party last week.  She had asked everybody to do something good for the world in lieu of birthday presents.  The awake were discussing options as I was dozing off.  I overheard somebody say "If you’ve got an old linux box that you’re using as a firewall drawing 400 watts continuously, consider spending $30 on a dedicated router."  I thought about the headless Pentium 3 box in my office closet which is running the IP Cop Linux firewall distro.  I thought about the four matching ethernet cards I’d put in it and the rainbow of color-coded cat-5 coming off it: red for untrusted outside world, green for safe, orange for servers and blue for wifi.  I thought about all the time I’d spent configuring the thing perfectly and routing cables throughout the house and I thought, yeah it draws a lot of power, but I NEED all that.

When I sobered up the next afternoon it occured to me that I’d pulled my file server off the orange DMZ network for performance and simplicity, and that the other server box had long since been virtualized into the file server.  I moved my local public wifi off the blue network onto the red to make its security brain-dead simple.  So despite all the pretty color-coded cables and corresponding hubs, all I really had was a big loud NAT box with a few key port holes in it.  And since I’ve switched from outlook to Gmail, I never even RAS into my home XP boxes any more.  And since I do all my personal development on EC2 or some other host, I never use my home dev servers any more.  So in fact, I don’t need to tunnel home for anything.  Cloud computing.  For real.  All this stuff I used to need I don’t any more.  I could replace that old linux box with a cheap low-power firewall.

But that got me thinking.  There’s this li’l XP box sitting next to the printer that I have configured never to go to sleep because otherwise I can’t print from my laptops.  Print servers are similarly small and low-power and sometimes come in the same box as the firewall.  Then my eye turned to the terabyte file server in the corner and next thing you know I’ve got an Apple Time Capsule in the mail to replace all three permanently powered-on PCs in my house.

Happy BEarthday, Miller!

Heavy laptops: there’s no excuse

Posted in Gadgets, Hardware on October 30th, 2007 by leodirac – 1 Comment

The way I see it, there’s no compelling reason to buy a heavy laptop.  Light laptops are great because they’re portable.  Their processors might be a little slower, but local processing power rarely limits what you can do with a computer these days.  And unless you get a really tiny laptop they’re hardly slower.  If you do get a tiny one then you’re trading reduced HCI-bandwidth for increased access to that bandwidth, which is often worthwhile.  Today I’d probably argue that iPhone or iPod Touch is a pareto-optimal choice (sweet-spot) in this trade-off, beating out things like OQO and FlipStart.

But think about the longevity of these devices.  Computers always slow down.  In a few years, any laptop is going to feel really slow, no matter how fast it feels today.  But if it’s a light, small laptop, then you’ll have something which is slow, but at least nice and portable.  Some of my friends’ house has this ancient Pentium II Viao laptop kicking around the living room — it barely runs a browser.  But it’s so small and portable that it’s still a reasonable computing device today.  If your laptop is heavy to start with, then in a few years when it slows down you’re stuck with a heavy, slow laptop, which nobody nobody wants.

A Small Tip when Setting up a Raid array

Posted in Hardware on January 7th, 2007 by leodirac – Be the first to comment

If you’re building a RAID array for your home, or somewhere else that isn’t super-industrial-strength enterprise, here’s a tip.  Get each hard drive as a different brand.  That way it’s way easier to tell them apart.  If your drives are identical save for a serial number, and one of them crashes, the raid controller will tell you the serial number of the crashed drive, and then you need to figure out which of your drives to pull and replace based on that, and you probably try it by just unplugging each of them and seeing when the system thinks it’s still got a valid drive.  But if they’re different brands of drive, then the BIOS or whatever is running your raid system will tell you exactly which one is dead, by name.

Another good reason is that whole batches of drives sometimes get manufacturing defects.  So if you get 3 or 4 drives from the same batch with sequential serial numbers, they might all have the same defect.  So the odds of 2 drives crashing at once are much lower if they came from different factories.  Most RAID schemes can survive a single drive failure, but few can survive multiple.

That’s all.