2009-06-01

Parity

I didn't go into too much detail about my Google Summer of Code project last time. It is: improving RAIDframe parity handling. And now I'm going to be excessively verbose about it.

Specifically: the thing about RAID levels that provide redundancy (i.e., not RAID 0) is that there's some kind of invariant over what's on the disk: both halves of a mirror are the same, or each parity block is the XOR of its corresponding data blocks, &c. And the thing about software RAID is that, if the power goes out (or the system crashes) while you're in the middle of writing stuff to each of the disks, some of those writes might happen while others don't. Then, when the lights come back on, the invariant may no longer hold for any stripe that was being written.

This is of particular concern for RAID 5, because if the parity is still wrong when (not if) a disk fails and one of the data blocks needs to be reconstructed by XORing the parity with the remaining data, you will get complete garbage instead of the data you lost. This is bad.

One solution, and the one currently used in NetBSD, is to set a flag on each disk making up the RAID when it's configured, and clear it when it's unconfigured. If that flag is already set when the set is brought up, then there might have been an unclean shutdown requiring the parity to be recomputed.

That is, requiring the entire array to be read from beginning to end. Which, as magnetic disk drives pack more and more tracks onto their platters, inevitably takes longer and longer. As it is, each unclean shutdown requires many hours of parity rewriting, during which the disk I/O load interferes with whatever the system's actual job is. This is also kind of bad.

It is said that the Solaris Volume Manager (which I briefly administered an instance of, but didn't have to care how it worked in this much detail) divides the RAID into some number of regions and records for each one whether its parity might be out of sync. This seems like a simple enough idea.

Except it's kind of not. Ideally, you'd like as many of these regions to be marked clean as possible, to cut down on the parity rewriting time. On the other hand, because you'll have to do disk seeks (and probably disk cache flushes, too, and hope the firmware isn't too broken) to set or clear a region's dirty bit, and it's absolutely essential that that bit-setting hit the disk before any writes to the region are done, you also want to hold off on marking clean those regions that you think might be getting written to sometime soon.

So, if you're getting truly random I/O, then you're kind of stuck. But, if what's on top of the RAID is some halfway reasonable filesystem that's been painstakingly designed to exhibit reasonable locality of reference, then recent write activity should (at the region level) be a decent predictor of the future. I hope.

And then there's the part of the project where I get all this integrated into the kernel, which is beyond the scope of this post.

No comments: