Geek

Learning to do Math in your Head

Posted in Geek, Personal Growth on January 23rd, 2010 by leodirac – 1 Comment

I recently picked up a book called Secrets of Mental Math written by one of my college math professors.  It has very practical advice on how to learn to multiply large numbers in your head.  He gives practical advice on necessary skills like addition, subtraction, and related mathematical trivia.  To practice multiplying numbers in your head, I’ve created a fast, simple javascript tool which you can access from your phone at http://leodirac.com/mathquiz .

The author of the book is Arthur Benjamin.  He gave a demonstration of his mad skillz at TED a while back, which I’m embedding here because it’s awesome.

Migrating this blog has been fun because it’s forced me to look over a lot of the old content I’ve written.  A couple years ago I found Benjamin’s Ted talk, which has inspired all this craziness.  I think it’s good to keep the brain fresh by taxing skills that one might not have used in a while.

Dinocams – The legacy of SLR cameras in the 21st century

Posted in Gadgets, Geek, Travel on March 1st, 2009 by leodirac – 6 Comments

DSLR cameras make very little sense today.  Modern imaging technology is rapidly turning them into dinosaurs.  The forces keeping them alive are a combination of a physical legacy in hunks of glass, and aspirational marketing.  I’ll explain, but first, what’s a DSLR and why don’t they make sense?

Background on SLRs and DSLRs

(If you what “f-stop” means, feel free to skip ahead to the next section.)

SLR stands for Single Lens Reflex.  Practically speaking it refers to a camera where you can change the lens.  You look through the same lens that actually takes the picture, letting you put any lens from an ultra-wide angle fisheye to a telescope-length zoom lens.  You can also put filters on the front like star filters or color shifters or polarizers.  Imagine a classic 35mm camera — like what a P.I. would carry to snap pictures of your wife having an affair — that’s an SLR.

SLR’s require a mirror that physically moves to divert the light into one of two places — your eye, or the film / CCD. The mirror was important when the only technology for capturing images was chemical film.  But nowadays we have various electronic devices like CCDs that digitize an image.  DSLR cameras use a CCD to get many of the benefits of digital imaging, but still have the same physical form factor as an old chemical-film SLR.  They can use the old lenses, which is one of their big appeals.  But so many things about these cameras just don’t make sense.

The problems with DSLR cameras

First there’s the noise. The sound of the mirror slapping against its stops as it switches positions is very recognizable. We used to accept sounds like that as a necessary part of taking
pictures.  Today it just annoys me.  Especially when I’m at a small
event and some photographer is there making loud clicking noises all
the time while I’m trying to enjoy whatever it is they’re digitizing
with their dinocam.  In 99% of all use cases, it’s totally unnecessary.  CCDs can continuously capture images and display them on a screen, creating a digital light path that doesn’t require loud expensive mechanical assemblies.  These displays aren’t as good as what a human eye can pick out, so this doesn’t work all the time.  But if you don’t need interchangeable lenses, then the camera can have a second optical path just for the eye, which doesn’t need to be as good.

One argument against a separate optical viewfinder is that youc can’t put filters in front of the lens.  This is very true, but filters are also obsolete.  With few exceptions, everything that a physical filter does can be done later in photoshop with more control and accuracy.  Color tinting, sparkle, gradients, soft, mist, etc — these all used to be rendered in physical glass out of necessity.  Polarizing filters are probably the most important exception to this — since CCD’s don’t record a light’s polarization state, it can’t be adjusted later.  But for the most part, filters aren’t necessary anymore, meaning you don’t need the whole single-lens thing.

But what about interchangeable lenses?  Isn’t it useful to have the same camera body and be able to change lenses?  (I hear you cry.)  Yes, sorta.  There are definitely situations where one lens won’t be able to do everything you want.  But those situations are getting rarer and rarer.  And in the few exception cases, I’ll argue that interchangeable lenses aren’t the right solution.  The reason these cases are getting less and less common is that zoom lenses are getting better.  When SLR cameras first came on the scene zoom lenses basically didn’t exist because they sucked when they did.  You needed a different lens for each amount of magnification you wanted, so people had lots of lenses.  But with computers to help us design the lenses, and vastly improved manufacturing processes, zoom lenses are getting better all the time.  Nowadays a lens with a huge 10x zoom can even win accolades from camera snobs.  And lenses as versatile as 26x cover every situation most of us would ever want, and at a quality we’ll be thrilled with.  So for almost all situations, a single zoom lens is good enough today.

What about the situations where that’s not quite good enough?  Where you need that 14mm fisheye that captures people standing immediately to the left or right side of the lens?  Or that 8000mm super- long telephoto telescope?  It turns out in either of these challenging cases, getting the lens to fit the standard SLR form factor becomes the hardest part.

Why SLR’s cripple even the extreme lens cases

With ultra-wide fisheye lenses, the problem is the space reserved for that stupid mirror.  In this case, the focal length is very short, so as a
lens designer, you’d naturally want the focal plane to be very close to
the glass.  (Like about 14mm.)  But the place where the lens attaches to the camera body necessarily needs to be a certain distance away from the imaging plane.  That distance was determined by the size of the mirror, which was determined by the size of your chemical film — 35mm, which is more than you’d really want for a 14mm lens.  Even on today’s 2009 DSLR cameras, that distance is exactly the same as it was a generation ago in order to ensure backwards compatibility with old lenses.  The literal tons of carefully polished glass represent a very real barrier to improvement since people have invested lots of money in them.

So if you really want a camera that’s good at taking super-wide angle pictures, you don’t want your lens to have to be that far away from the imaging plane.  You’re better off with a specially built camera.  The lens will be simpler, cheaper and higher quality.  But super-wide starts to look funny, no matter what.  Funny meaning
distorted, because if your eye is more than a couple of inches away
from the reproduced super-wide image, then it won’t look right.  And it’s not super useful to capture 360 degrees in one shot — you can shoot a dozen pictures and stitch them together later in software, and it’ll look more natural.  This is all why people don’t pay a lot of attention to how super-wide lenses get anymore.

On the super-telephoto side of things, the SLR legacy is even worse.  To get a super-long telephoto lens you need lots of big glass.  This gets expensive quickly simply because it’s a large mass of carefully manufactured stuff.  The amount of glass you need for a lens is proportional to the cube of the length of your imaging plane, which for legacy chemical-film is 35mm. But CCD’s just don’t need to be that big.  On almost every DSLR they’re only about 20mm across, and on high-quality non-SLR cameras are as typically about 6mm across.  So that size legacy means you would need literally 200x  the almost 40x the amount of physical glass to make a good telephoto lens for an SLR vs a non-SLR camera.  This ridiculous discrepency is just going to get worse.

CCD’s are silicon devices, so they share manufacturing improves along with CPU’s and follows a Moore’s law-like improvement curve for performance.  A key way they improve is in pixel density, but also by simply getting smaller.  As they get smaller, high-quality zoom lenses get smaller and cheaper too.  But only if the lenses are specifically designed for the new smaller CCD’s.  With an SLR system they can’t be — the size must be fixed in order to maintain backwards compatibility.  So while sensor technology improves at Moore’s law speed, lenses for non-SLR cameras improve as well, but SLR lenses do not.  Expensive zoom lenses for modern cameras just don’t need to be that big or expensive — It’s like having to build a cell-phone big enough to hold floppy disks.

To illustrate this point, consider the popular Canon SX10IS camera which does not feature interchangeable lenses.  It features a zoom lens that goes from pretty wide (28mm equivalent) to really very far zoom (560mm equivalent).  Because its CCD is only 6mm across, it can do all this for under $400 and weigh in under a pound for the whole camera.  For comparison, a comparable SLR lens weighs in at over 11lbs and costs upwards of $7,000, just for the lens.  No doubt this lens can take better pictures than the tiny Canon, but a smaller lens built for a modern CCD could take pictures that are every bit as good for a fraction the price.

I would be remiss if I didn’t mention the noise floor on these sensors.  When the scene is dark, you need more light to get a good image.  A bigger hunk of glass captures more light.  This all makes intuitive sense and is mostly accurate.  CCD sensors can take more accurate pictures in low light when they are bigger.  But the limits here are electronic noise, which is also improving.  At some point we’ll hit some other barrier like the thermal noise in the sensor, although a piezo cooler could work around that.  Ultimately there’s the the quantization of photons, but if you’re taking pictures in a scene that dark, you probably can’t see what you’re pointing at anyway.  My point is that while there are advantages in low light for larger glass and sensors, technology is erroding away at those too.  We’re seeing ISO equivalents of 6400 as fairly common in cameras these days, with an economic competitive pressure to improve that.

In summary, the problems with the SLR format are that it ties its owner to a physical legacy that denies them the advantages of advancing technology.  There are cases where specialized lenses are still important.  But those cases are dwindling.  Personally, I’m going to be happier carrying around a full featured small camera that can transform itself into whatever I want without needing interchangable parts than a bag full of bits that were standardized before email.

RAID repair successful

Posted in Geek, Hacks, Hardware on December 21st, 2008 by leodirac – Be the first to comment

For everybody who has been waiting with baited breath to hear whether or not the repair of the RAID array worked or not, it did. It took several days, but since we were away on vacation seeing my dad receive the Fleming Medal from the American Geophysical Union, the waiting was pretty easy.

To convince myself that the repair was successful, I unplugged one of the previously functional drives, and saw that all my files were still there when the array was running just on the new drive and the other previous drive. I recommend this to anybody who thinks they’re running a RAID system — until you’ve seen the RAID array work with a drive removed, how can you be sure it’s really working? If your system is set up better than mine is, you’ll get some kind of warning message too.

Repairing a degraded EVMS RAID 5 array

Posted in Geek, Hacks, Hardware on December 14th, 2008 by leodirac – Be the first to comment

A while back, lightning scrambled one of the disks in my home RAID 5 array.  I figured out how to recover it.  And I got the critical data off.   Here I describe the steps I took to add a new drive and get it working with the old RAID array.  I share this with the net in hopes it will make it easier for somebody else who has to go through this process themselves, and selfishly as notes for me to refer to.  It’s a testament to the power of EVMS and a warning to anybody who thinks it might be fun to run their own open-source software RAID server at home. 

My advice for people seeking reliable storage: go with a hosted solution.  Understanding the arcane nuances of these software systems is an extremely specific skill that doesn’t translate well to many real-life necessities.  If you’re smart, you can figure it out, but it doesn’t teach you much of anything except how to do exactly that.  Each person who understands this stuff should be keeping petabytes of data happy, rather than one couple’s pictures and music collections.  I hear Microsoft’s "home server" actually makes this pretty easy, but I can’t recommend anybody willingly lock themselves into Microsoft’s business model.

Background

So I bought a new drive, following my own advice about picking drives from different manufacturers when building a raid array, and plugged it in to the mobo and booted the machine.  After futzing with /etc/fstab to get it to find the boot disk and load up, I logged into evms and got these messages:

MDRaid5RegMgr: RAID5 array md/md1 is miissing the member  with RAID index 0.  The array is running in degrade mode.

and

MDRaid5RegMgr: Region md/md1 is currently in degraded mode.  To bring it back to normal state, add 1 new spare device to replace the faulty or missing device.

Conceptually easy.  I’ve got a new 500 GB drive in the system.  Linux sees it.  It didn’t take me too long to figure out it’s called /dev/sda, while the previous 2 disks in the array are sdb and sdc, with a small boot drive at sdd.  Now the fun part is figuring out EVMS terminology enough to tell it to use the new disk.

The hierarchy of the array in EVMS land seems to be as follows:

  • Logical Volume teraraid (contains)
  • Region md/md1 (which contains)
  • Segments sdb1 and sdc1 (which are built on)
  • Logical disks sdb, sdc.

What I tried, and what seems to have worked

I see that logical disk sda has no segments.  So I try Action -> Create -> Segment.  It only gives me one choice for "Segment Manager" which is "GPT Segment Manager."  But when I choose it, it doesn’t let me make a segment on sda.  Only the tiny free space on sdb and sdc.  So sda needs something else done to it before we can use it.  What?

sda also shows up in the list of Logical Volumes, next to Teraraid and the formatted boot partition.  Hmmm.

Well I tried converting it to an EVMS Volume.  It complained that sda does not have a File System Interface Module (FSIM) associated with it, but it made the new logical volume anyway.  This wasn’t getting me anywhere.  So I erased it.

Next I tried "Add" -> "Segment Manager to Storage Object".  I noticed that all of the Disk Segments associated with the array were listed as using "Plug-in" "GptSegM" and this gave me the choice of adding Gpt Segment Manager to sda.  W00t.  I said "No" to make this a system disk.  This seems to be working.  Now I see a bunch of Disk Segments starting with sda, including a big one (465 GB) labelled sda_freespace1. 

Now when I tried to Create -> Segment, it let me use GPT Segment Manager on sda_freespace1 and allocate a 450 GB disk segment to match the others.  (I left 15 GB off each disk with the idea I could put a boot segment in that space, but I’ve never gotten around to it.)

Now in "Available Objects" there is sda1 with 450.0 ready for me.  Alrighty we’re getting there.

Now I look at "Storage Regions" and in the context menu for md/md1 I see an option that says "Add spare to fix degraded array…"  I didn’t see it there before — it might have not shown up when there weren’t any spares, or maybe I was just being thick.  In any case, selecting it now gives me a menu with one choice — sda1. 

Now in details of md/md1 it shows:


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                    Value                                      │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number            9                                          │ md│     Minor Number            1                                          │   │     Name                    md/md1                                     │   │     State                   Discovered, Degraded, Active               │   │     Personality             RAID5                                      │   │ +   Working SuperBlock                                                 │   │     Number of disks         3                                          │   │ +   Disk 1                  sdb1                                       │   │ +   Disk 2                  sdc1                                       │   │     Number of stale disks   1                                          │   │ +   Stale disk 0            sda1                                       │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

That last line about the Stale disk is new.

Actions -> Save commits these changes to disk.  Now looking at Detailed information for md/md1 shows


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                    Value                                      │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number            9                                          │ md│     Minor Number            1                                          │   │     Name                    md/md1                                     │   │     State                   Discovered, Degraded, Active, Syncing =  0 │   │     Personality             RAID5                                      │   │ +   Working SuperBlock                                                 │   │     Number of disks         3                                          │   │ +   Disk 1                  sdb1                                       │   │ +   Disk 2                  sdc1                                       │   │     Number of stale disks   1                                          │   │ +   Stale disk 0            sda1                                       │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

Emotionally I feel like I should be done now.  But I don’t hear the thrashing noise of a half-terabyte of of checksums being unwound and copied onto a fresh disk.  And it says "Syncing = 0".  Hmmm.

I quit evmsn and reload it to see two new messages.  One familiar:

MDRaid5RegMgr: Region md/md1 is currently in degraded mode.  To bring it
back to normal state, add 1 new spare device to replace the faulty or missing device.

And one novel:

MDRaid5RegMgr: RAID5 array md/md1 is missing the member  with RAID index 0.  The array is running in degrade mode.  The MD recovery process is running, please wait…

But this novel message saying it’s recovering is "Number 0" implying that it came before the other message (Number 1) which tells me I need to take action for it to fix itself.  And the drives are not thrashing.  Again I look at the details for md/md1 and now I see:


 Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐ ──│                                                                        │── lv│     Name                 Value                                         │ lv│ ────────────────────────────────────────────────────────────────────── │ lv│     Major Number         9                                             │ md│     Minor Number         1                                             │   │     Name                 md/md1                                        │   │     State                Discovered, Degraded, Active, Syncing =  0.3% │   │     Personality          RAID5                                         │   │ +   Working SuperBlock                                                 │   │     Number of disks      3                                             │   │ +   Disk 1               sdb1                                          │   │ +   Disk 2               sdc1                                          │   │ +   Disk 3               sda1                                          │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │                                                                        │   │    Use spacebar on fields marked with "+" to view more information     │   │                                                                        │   │ [Help]                                                          [OK]   │   │                                                                        │   └────────────────────────────────────────────────────────────────────────┘

Which really seems to say its doing its thing.  Maybe I don’t hear the disks because it’s formating the disk first, which is a linear process.  Or maybe the whole copy process is very linear and I won’t hear it thrashing.  Its progress implies it’s going to take a couple/few days to finish, which is what I’d expect.  So maybe it’s working. I’ll let it run for a while and see what happens to the array if I try to unplug one of the previously working drives.

Pretty cool that I didn’t even need to unmount the array to do this.

Now if I could just figure out why my laser printer periodically decides it needs to print it internal test page, I’d be even happier.

XMPP PubSub: a great compliment to Atom/RSS

Posted in Geek, Social Computing, System Architecture, XMPP on July 22nd, 2008 by leodirac – 8 Comments

I spent the day yesterday at XMPP Summit #5 alongside OSCON in Portland.  It was a great chance to catch up with old friends and meet a few new ones.  But my favorite part was the break-out discussion of XMPP PubSub as it relates to micro-blogging.  We discussed what hopefully will emerge as a standard way to associate an existing Atom/RSS feed with an XMPP PubSub Node.  First some background on the relevant technologies.  Feel free to skip ahead if you understand this stuff.

PubSub 101: Push vs Pull

PubSub is short for "publish subscribe" which is a common design pattern describing a way to distribute information to interested parties.  The publisher tells a server about new information, and the server fans the information out to everybody who has registered interest in that topic or channel.  Data consumers find out about the new information very quickly, with relatively little load on the whole system, since the pubsub mechanism provides a means to "push" data to them. 

By contrast, almost all of the web today follows uses a "pull model" where a data consumer only finds out about new information when it gets around to checking if there is something new.  This data distribution model is significantly simpler because the server only needs to keep track of the content, not who is interested in knowing about it.  Modern networks are optimized for this kind of query-based traffic where data consumers (clients, web browsers) initiate connections to servers, such that it’s often impossible for servers to initiate conncetions to clients because of firewalls or NAT.

The downside of the pull model is that the only way a data consumer can find out if thanything is new on the server is to "check back frequently" or "poll" the server for changes.  If you want to know within 15 minutes if anything new has been posted, you have to ask the server at least every 15 minutes "anything new?"  No.  "How about now?"  No.  "Got anything yet?"  No.  Mulitply this by potentially millions of interested data consumers and you end up spending a lot of network bandwidth and server resources doing very little.  Even worse, the problem scales horribly.  If clients want to know about changes within 5 minutes instead of 15, that puts 3 times the load on the server.  Want to know within a few seconds?  Forget it — the servers would crash.  There’s an intrinsic delay in distributing information in this model, and reducing that delay is very expensive.

XMPP as an alternative to polling

XMPP is the protocol used for Instant Messaging by Google Talk and Jabber and a large number of small servers.  In order to deliver instant messages, XMPP systems maintain persistent connections between all machines that allow packets of data to be pushed with very low latency — IM messages are typically delivered within a second of sending them.  So it’s natural to want to use this infrastructure to deliver other data more efficiently than through polling HTTP.

The XMPP PubSub spec known as XEP-0060 describes how to do exactly this at the protocol level.  But for a variety of reasons, this standard has not gained wide adoption.  IMHO the biggest reason is that there isn’t a very pressing need.  The current system is horribly inefficient, but it works.  Moreover, it puts the burden of inefficiency squarely in the hands of the information publishers.  Popular publishers are just expected to shell up for necessary hardware to meet the demands of their readers, and with advertising they can typically recoup the necessary investment.

Another way to state that is that pubsub hasn’t found its niche yet.  IMHO this is partly because the mechanism is so useful it can be applied to almost anything.  Not just breaking news, but everything from e-mail mailing lists to doorbell chimes get used as examples of how XMPP pubsub technology could be applied.  Not wanting to exclude any of these potentially interesting uses, the protocol remains very generic.

Micro-blogging, Atom and Yesterday’s Realization

One place where the current HTTP model breaks down is micro-blogging, which is the generic term for services like Twitter or Facebook’s udpates.  Here, the payload of actual content is very small, so the overhead of checking far outweighs the "useful data" which is delivered.  Also, because the information is social (i.e. "Heading to Broadway for a bite — wanna come?") consumers demand it be delivered quickly.  Nonetheless, current micro-blogging services still rely on polling clients, and their servers suffer as a result.

Yesterday, a group of us including Blaine Cook, Anders Conbere, Evan Prodromou, and XEP-0060 co-author Ralph Meijer were discussing how to apply XMPP PubSub to micro-blogging.  This was likely obvious to many there already, but during the discussion I had a realization.  We aren’t solving this problem from whole cloth.  RSS and Atom feeds already describe all the information we need.  We just need to find a way to substitute XMPP for the assumed transport HTTP.

So we discussed mechanisms for mapping an Atom URL to an XMPP PubSub Node.  (We pretty much ignored RSS because RSS isn’t as cool for reasons I really don’t understand.)  We talked about putting a link-rel tag in the feed to point to the XMPP PubSub node.  This would look something like 

   <link rel="alternate" type="xmpp/pubsub" href="xmpp:twitter.com?;node=users/leopd" />

Even better, the URL for these nodes should be guessable from the URL for the HTTP feed.   The above node would be the normal place to look for a the pubsub version of http://twitter.com/leopd.  Even though it’s not as generic and robust to have a standard mapping like this, I think it’s an important way to speed adoption of a new standard.  The code to do a bit of string manipulation is vastly easier than fetching the URL and looking for a link-rel tag.  And developers are intrinsically lazy (for good reasons!) so making things easier for them means they’ll succeed a lot more.

Ever pragmatic, Blaine pointed out that we should use HTTP for things it is good at, and not re-invent them in XMPP.  I wholeheartedly agree.  Re-transmission is a key example.  What happens if a client is offline when a new post happens, and so never hears about it?  Answer: The clients should fetch the historic archive of the feed over HTTP.  These feeds exist today — no need to improve on them.  If all the posts have sequence numbers on them, then it’s easy to figure out if you’ve missed one.  So all the posts from a user should have sequence numbers.  I don’t think this is standard in Atom feeds today.

The story unfolds…

There’s a lot more to be worked out and standardized here.  And clearly many more people need to voice their opinions before we can reach consensus.  Sadly I can’t be down in Portland today to continue the discussion, so this post will have to take my place as I return to my regular daily commitments.  If you’d like to stay tuned as the story unfolds, you’ll have to poll this site, as I can’t yet give you a PubSub node to subscribe to for updates.  If I could it would probably be something like xmpp:embracingchaos.com?;node=xmpp — try it.  By the time you read this, it might be working!

Recovering a RAID Array after Lightning

Posted in Geek, Hacks on July 9th, 2008 by leodirac – Be the first to comment

RAID arrayThe EVMS RAID 5 array in my linux fileserver crashed recently due to a lightning storm, and I thought I’d lost everything.  But with some luck and intuition I was able to recover all my files.  I’ll tell you how I did it, so hopefully others who run into similar problems can recover their data too.  But first, a little background.

Last week Seattle had some crazy electrical storms.  In recent years’ storms, my block has done better than most with respect to power failures making me think we’re either lucky or in a particularly robust section of the grid.  So I was a little surprised to find my whole house offline on Wednesday morning.  After a bit of debugging I figured out that the small UPS that runs all my networking gear got toasted, and for some reason the file server was down.

I left it alone for several days, and when I got around to turning it back on, I was happy that the whole stack through the samba server came up by itself.  (It doesn’t always!)  But when I started looking around I quickly realized things were amiss.  The media/video directory normally has 4 subdirectories: movies, episodic TV, imake and other.  But today it listed:

    leo@elephant:/raid/shares/media/video$ lsdpisndic TV  hmakd  movies  nther

WTF!?  A few bits had been scrambled in the directory names.  This sounds really bad.  Moreover, even though the first couple levels of the directory hierarchy were there, but no files were to be found.  Definitely a problem.

Step 1: As soon as you suspect your RAID array has a problem, stop writing to it until you know what’s going on.  Writing changes can make things worse.  Stop the bleeding.   

I unmounted the drive from my mac, not trusting Finder or Spotlight to sprinkle damaging meta-files over the array.  Once I remembered how to ssh into the box, I stopped the samba daemon,

    leo@elephant:/$ sudo /etc/init.d/samba stop

unmounted the filesystem

    leo@elephant:/$ sudo umount /raid

and changed fstab so it would be read-only when it comes back, and that it wouldn’t come back without me asking.

    leo@elephant:/$ sudo vi /etc/fstab

changing

    /dev/evms/teraraid500 /raid ext3 defaults  0 0

to

    /dev/evms/teraraid500 /raid ext3 ro,noauto  0 0

I tried poking around in EVMS by running

    leo@elephant:/$ evmsn

But it hung during initialization with blue dialog saying "Discovering segments…"  I’m thinking EVMS can’t help me.  After a bit of googling I thought I should try e2fsck or some such.  First, I tried to mount it again read-only and see what’s there.

    mount: wrong fs type, bad option, bad superblock on /dev/evms/teraraid500,       missing codepage or other error       In some cases useful info is found in syslog - try       dmesg | tail  or so

Bad superblock.  Uh oh.  Well this guy managed to recover a drive with a bad superblock.  Lots of things were pushing me in this direction — fix the filesystem.  But I realized that was a mistake.

Step 2: Do not make changes at the filesystem level until you’re confident that the RAID array is working properly.  You set up RAID for a reason.  You’ve still got a chance to recover everything, but if you start
making changes to it in a broken state, you’re almost certainly going
to make things worse.

Me to self: Think about it.  EVMS is confused.  Linux is confused.  Ext2 and ext3 are messed up complaining about bad superblocks.  The problem was caused by lightning.  When the drive was mounted there were wierd bit-level corruptions in the data that were still there.  Maybe one of the drives in the array got data scrambled, but didn’t get totally fragged so it went offline.  RAID 5 is designed to survive total loss of a single drive.  But if a drive gets corrupted, who knows what will happen.  So I came up with this plan:

Step 3: Try physically disconnecting the drives in your array, one at a time.  If only one of them is scrambled, disconnecting it should restore all the data in the array.

Having followed my own advice, it’s easy for me to tell the drives in my array apart since each drive in the RAID array is from a different manufacturer (which makes array failure due to manufacturing defects far less likely). 

This plan actually worked perfectly!  Removing a drive caused a bit of a hassle in getting the machine back up, because when I booted it couldn’t find the /boot partition complaining

     * Starting Enterprise Volume Management System...[42949392.340000] raid5: raid level 5 set md1 active with 2 out of 3 devices, algorithm 0
    
     * Checking all filesystems...fsck.ext3: No such file or directory while trying to open /dev/sdd5/dev/sdd5:The superblock could not be read or does not describe a correct ext2 filesystem.  If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:    e2fsck -b 8193 <device>

Notice the complaint about the superblock again — don’t trust it, and don’t do what it says!  What really happened was that the boot drive letter had been changed from /dev/sdd to /dev/sdc, so I had to change /etc/fstab to mount /boot from  /dev/sdc5 instead of /dev/sdd5.  In my system, I boot off a non-RAID disk attached to the mobo, which for some annoying reason gets the last drive letter after all the drives no the SATA card.

But once I got past this, it quickly turned out that the Samsung drive was the culprit.  With it removed, the software RAID kicked in and plugged all the whole.  Everything the array looked
completely normal again.  All the directories.  All the files.  Hooray!

Flashbacks to College Math

Posted in Geek on January 18th, 2008 by leodirac – 2 Comments

Callin’ out to all the Mudders in the audience.  Ever take math class from Dr. Benjamin?  I did.  Sophomore year I think.  Real Analysis maybe?  It bent my brain sideways.  Fun class.  Got an A.  I felt like I was starting to “get math” after that class.

He does this trick where he multiplies 5 digit numbers together in his head.  He was a nationally ranked backgammon player in his spare time.  #40 or so in the US at the time IIRC.  It turns out that serious backgammon is all about estimating your fractional odds of winning at any given point and raising the bet when you think you’re ahead.  So for him it was a “live probability lab.”

I was reminded of this when a friend was watching a video and sure enough it was one of my old math profs.  So here’s the video of him showing off at Ted.  You go!  Thanks for the great math.

Watch Professor Benjamin’s Mathemagics show at Ted