Data integrity: the Prime Directive

Published: Jun 10, 2021 by luxagen

While I’d love to debate the moral implications of a certain android’s adherence to Starfleet’s highest law, what I want to talk about today has to do with my 15-plus-year mission to never again lose data unintentionally — specifically a principle whose importance I only fully grasped quite recently.

Yule regret this

Christmas is traditionally a time for family and friends, but it’s also a convenient lull in the year’s pace, and I’ve developed a habit of using the period to get around to some of the more mundane IT-infrastructure tasks that are rarely urgent enough to do during the rest of the year, especially data management. This last Christmas, therefore, I took the step of trying BTRFS for the first time.

In retrospect, I made two mistakes. For one thing, the BTRFS version shipped with Ubuntu 16 was in the last stage of relative immaturity and I really should have upgraded to Ubuntu 20 before trying it. The far greater mistake, though, was to use its spanning feature to join two 8 TB SMR hard drives into a single 16 TB volume. In retrospect, this was a Very Bad Idea™ because some of the things you need to be able to do in a spanning/RAID setup — e.g. btrfs balance — cannot be rate-controlled. Not only do SMR drives tend to slow down after a few hundred gigabytes of continuous writes, some of them eventually grind to a halt, potentially leaving you with no way to get the filesystem into a consistent state.

These mistakes led to a nightmarish episode in which I got multiple levels down a stack of yaks, convinced myself that I had at least one faulty drive, and ended up with a broken filesystem that I didn’t have enough free space to run btrfs restore on. In the end I solved the last by splurging on server-grade WD Gold drives and managed to emerge from the chaos traumatised but without any actual data loss.

I threw BTRFS firmly in the sea, got everything shipshape again, and got on with my life.

Recently, the trauma faded enough for me to learn the correct lesson from all this.

A simple philosophy

When it comes to data management, there’s a whole category of concerns that seem both diverse and complex — until you realise that they all have the same answer.

Q: What filesystem should I use to minimise the chances of data loss?
Q: Should I get a UPS to guard against filesystem corruption during blackouts?
Q: Should I buy server-grade drives and/or make sure they’re CMR rather than SMR to preserve my data?

A: If you’re worried about this, you don’t have enough redundancy!

To illustrate:

“I want to use the most trustworthy filesystem possible.”
No you don’t; you want to use one that isn’t obviously hazardous, and think of it as disposable because you have enough redundancy.

“I want to make sure my power supply is as reliable as possible to minimise uncontrolled shutdowns.”
No you don’t; you want to evaluate the danger of acute data loss versus the real-world frequency of power-loss events, and assess the cost-to-benefit ratio of buying a UPS based on that.

“I want to use the most trustworthy drives money can buy.”
No you don’t; you want to make that decision based on rational considerations like (a) the amortised time cost of replacing faulty drives, (b) the financial cost, over time, of replacing cheap drives that might not last as long versus that cost for drives with a decent warranty.

“I really must get around to organising my decade-plus of accumulated digital junk so I can find things again and free up some space, and I need to do that first.”
NO, NO, NO!! (whacks speaker with rolled-up newspaper) What you really must do is urgently replicate that digital slagheap and achieve redundancy. Once that’s in place, THEN you can sift through it all knowing that the inevitable missteps won’t lose you anything.

I should make clear that the four above points are not strawmen but actually factor into my thinking. The important thing is that by removing the data-loss penalty for mistaken decisions, I don’t have to agonise so much and can take a “live and learn” approach.

How much?

While the irony of repeatedly banging on about redundancy isn’t lost on me, it bears repeating because it’s such an important paradigm shift. As for how much is enough, my current policy is 3 copies. Of course, this can be varied depending on the importance of the data, but I generally only do that for offlining data to tape. For online storage, I find it easier to apply a uniform policy, not least because it avoids incessant knapsacking difficulties.

Outlook: chilled

Learning the correct lesson has meant overcoming my fear of BTRFS and other “experimental” filesystems, so I’m trying it again. This time it’s on Ubuntu 22, and definitely without any spanning or RAID that would get in the way of recovering from a single-drive failure. Then again, who cares? I have enough redundancy, so if at some future point spanning offers any value, I can reconsider it.