2009-06-12

It Is Written

What I'll call a first draft of the RAIDframe parity map stuff is written, and compiles and links, and if run will actually do something. That something will, in practice, probably involve bugs.

Now to get my QEMU setup into something more resembling a useful state, because the time I spend on that will almost certainly be paid back by not waiting for my test box to reboot, and I've been meaning to deal with that anyway. Once this gets to the point of serious benchmarking I'll need to use actual hardware for the most part, of course.

The RAIDframe codebase, incidentally, is… not unelaborate.

2009-06-11

The RAID Project: Things Not To Do

  1. Let writes to the RAID hit the disk before the corresponding parity map bit is set on disk.

  2. Let writes to the RAID hit the disk after the corresponding parity map bit is cleared on disk. (That is, updates which just mark regions clean again still need a barrier.)

  3. Have one write see that its region needs to be marked unclean, then do that, and then before that actually gets committed to the disk another write to the same region sees that it's allegedly already marked and just does the write, which happens to hit the disk before the parity map update and then the power goes out at that exact moment.

    This may not even be possible — I think I'll only ever be starting writes from one particular thread, given the RAIDframe architecture, though I'm not sure of that yet — and even if it is it sounds stunningly unlikely. Which is to say that if I get this wrong I may never find out; so don't do that.

    Point also being that it's important to keep invariants in mind when dealing with shared-state concurrency, including those invariants that involve the state of secondary storage and the potential behavior of loosely specified hardware as well as the program's data structures proper.

Completely unrelatedly, I've just learned that the posting interface here rejects ill-formed HTML.

2009-06-01

Parity

I didn't go into too much detail about my Google Summer of Code project last time. It is: improving RAIDframe parity handling. And now I'm going to be excessively verbose about it.

Specifically: the thing about RAID levels that provide redundancy (i.e., not RAID 0) is that there's some kind of invariant over what's on the disk: both halves of a mirror are the same, or each parity block is the XOR of its corresponding data blocks, &c. And the thing about software RAID is that, if the power goes out (or the system crashes) while you're in the middle of writing stuff to each of the disks, some of those writes might happen while others don't. Then, when the lights come back on, the invariant may no longer hold for any stripe that was being written.

This is of particular concern for RAID 5, because if the parity is still wrong when (not if) a disk fails and one of the data blocks needs to be reconstructed by XORing the parity with the remaining data, you will get complete garbage instead of the data you lost. This is bad.

One solution, and the one currently used in NetBSD, is to set a flag on each disk making up the RAID when it's configured, and clear it when it's unconfigured. If that flag is already set when the set is brought up, then there might have been an unclean shutdown requiring the parity to be recomputed.

That is, requiring the entire array to be read from beginning to end. Which, as magnetic disk drives pack more and more tracks onto their platters, inevitably takes longer and longer. As it is, each unclean shutdown requires many hours of parity rewriting, during which the disk I/O load interferes with whatever the system's actual job is. This is also kind of bad.

It is said that the Solaris Volume Manager (which I briefly administered an instance of, but didn't have to care how it worked in this much detail) divides the RAID into some number of regions and records for each one whether its parity might be out of sync. This seems like a simple enough idea.

Except it's kind of not. Ideally, you'd like as many of these regions to be marked clean as possible, to cut down on the parity rewriting time. On the other hand, because you'll have to do disk seeks (and probably disk cache flushes, too, and hope the firmware isn't too broken) to set or clear a region's dirty bit, and it's absolutely essential that that bit-setting hit the disk before any writes to the region are done, you also want to hold off on marking clean those regions that you think might be getting written to sometime soon.

So, if you're getting truly random I/O, then you're kind of stuck. But, if what's on top of the RAID is some halfway reasonable filesystem that's been painstakingly designed to exhibit reasonable locality of reference, then recent write activity should (at the region level) be a decent predictor of the future. I hope.

And then there's the part of the project where I get all this integrated into the kernel, which is beyond the scope of this post.