Categories
Linux

Mitigating bitrot and corruption in a linux md raid-1 array

A while ago I had the misfortune of running into data corruption issues with two different motherboards both of which caused corruption in my mdadm raid-1 array. This is the overview of how I identified corrupt files and healed the array, which I hope may provide a useful signpost to anyone else who runs into similar issues. At the end of this article I link to another post where I provide more detail about the problems with a particular Gigabyte motherboard.

ASUS P5QL-Pro

The first, an ASUS P5QL-Pro would occasionally misread bytes 14 or 15 (mod 16). It had been doing this for several months before I gained a full understanding of what was happening. I experienced occasional processes crashing for no clear reason, a downloaded iso which,when burnt to a USB stick and booted, was corrupt. I had also noticed around the same time, that transmission (the torrent client I had used) would occasionally state that already downloaded files were corrupt, so I would have to recheck them and download any corrupt parts. I blamed transmission for having buggy torrent code. Eventually I noticed that if I copied a large file, its md5sum wouldn't be the same as the original. I was able to simplify the issue in the end merely by running md5sum twice on the same 64GB file and observing different checksums each time.

After trying various combinations of hard discs, sata cables, sata ports and filesystems, I ruled out the possibility of a failing hard disc, a bad sata cable, a damaged sata port or a Linux filesystem bug. I also used the most recent version of md5sum and sha256sum, ran memtest86, used a sata card instead of the onboard sata ports, and reproduced the issue after booting from a usb stick. This eliminated software issues in md5sum, bad memory, a bad sata controller chip and a bad Linux installation. I suspected that the motherboard might be at fault, so I swapped everything over to another motherboard. This solved the immediate problem but..

Gigabyte GA-P35-DS4

After swapping everything over to the Gigabyte board, it would no longer boot! Grub failed to load the raid partition. After some investigation it turned out that the BIOS was confiscating part of the disc in order to store a backup of itself in a "Host Protected" Area and there was no option to disable this feature in the BIOS! After removing the HPA and changing the SATA ports the affected discs were connected to, I was able to boot from the partition, but a chunk of data on one half of the raid-1 array had been overwritten. Raid-1 has two copies of data, so losing one of them might not be fatal.

How bad was it?

A quick google didn't reveal many people who'd recovered from incidents like this but there didn't seem to be much support within the mdadm toolset for raid-1 recovery, and there didn't seem to be many pre-existing tools to help.

I needed to catalogue the extent of the damage, so I wrote a programme to compare the two constituent partitions that made up the raid array and printed a message when they differ. After identifying the sectors that differed, it was clear that it was just a few megabytes at the end of 1 disc.

It was then a simple matter to replace the corrupted data with data from the good disc, and the md array would load properly!

The next challenge was to prevent this happening again, and the simplest method ended up being changing the order that my disc drives were connected to the sata controllers, so that the bios "stole" some of a different disc for its bios backup. Have a look at the following post if you want a lot more detail about Gigabyte's unfortunate bios.

By ff

Systems software engineer with interests in C/C++/Rust on Linux, electronic music and games.

Leave a Reply

Your email address will not be published. Required fields are marked *