RotKraken bugfix

Published: Jan 12, 2023 by luxagen

I’ve been dealing with a macOS backup disk containing both creative work and system backups. The disk’s contents total 1.6 TB, which would eat quite a few LTO-3 tapes (especially redundant copies), but the degree of data duplication allowed me to get the whole dataset down to only 330 GB by using rdfind in hardlink mode. BTRFS snapshots made it easy to do this in a controlled way and verify the results.

Since developing RotKraken, I’ve found it to have some fairly life-changing use cases, not least that it makes the job of organising my filesystem ridiculously fast and easy; suspected duplicate trees can be compared instantly with the included rkdiff tool before removal.

In this case, though, RK’s standard functionality was a lifesaver for checking both layout and content of rdfind’s dedupe results, except for one thing: there was a subtle library bug in the way.

Some time after running rk -i on the initial snapshot containing the raw filesystem copy, I discovered that some of the files had never initialised and, in fact, wouldn’t do so. They were all Icon^M files, which macOS software seems to use frequently.

After 60-90 minutes of investigation, I traced the problem to the Digest::MD5::File Perl library; it barfs on filenames containing trailing carriage returns, returning undef from its ->file_md5_hex method. The bug quickly turned out actually to be in the Digest::MD5->addfile API to which it delegates the hashing under the hood.

This hadn’t shown up in RK’s (fairly comprehensive) tests because, although they deliberately use pathological filenames containing all kinds of unusual characters, the names don’t end with a CR, which is where the problem actually shows up.

I worked around the problem by replacing the Digest::MD5::File->file_md5_hex call with an explicit open/slurp/close pattern that uses an instance of Digest::MD5 and calls ->md5_hex at the end. I verified correctness via the (now amended to use filenames with trailing CRs) tests.

After all this, RK worked perfectly on the dataset, allowing me (via the bundled rkdiff-stdin tool) to verify no difference between the copied files and the initial hash log I took from the source disk with md5deep. As a side benefit, RotKraken is now more practical to use on Mac system backups.