Categories
Linux

Mitigating bitrot and corruption in a linux md raid-1 array

A while ago I had the misfortune of running into data corruption issues with two different motherboards both of which caused corruption in my mdadm raid-1 array. This is the overview of how I identified corrupt files and healed the array, which I hope may provide a useful signpost to anyone else who runs into similar issues. At the end of this article I link to another post where I provide more detail about the problems with a particular Gigabyte motherboard.

ASUS P5QL-Pro

The first, an ASUS P5QL-Pro would occasionally misread bytes 14 or 15 (mod 16). It had been doing this for several months before I gained a full understanding of what was happening. I experienced occasional processes crashing for no clear reason, a downloaded iso which,when burnt to a USB stick and booted, was corrupt. I had also noticed around the same time, that transmission (the torrent client I had used) would occasionally state that already downloaded files were corrupt, so I would have to recheck them and download any corrupt parts. I blamed transmission for having buggy torrent code. Eventually I noticed that if I copied a large file, its md5sum wouldn't be the same as the original. I was able to simplify the issue in the end merely by running md5sum twice on the same 64GB file and observing different checksums each time.

After trying various combinations of hard discs, sata cables, sata ports and filesystems, I ruled out the possibility of a failing hard disc, a bad sata cable, a damaged sata port or a Linux filesystem bug. I also used the most recent version of md5sum and sha256sum, ran memtest86, used a sata card instead of the onboard sata ports, and reproduced the issue after booting from a usb stick. This eliminated software issues in md5sum, bad memory, a bad sata controller chip and a bad Linux installation. I suspected that the motherboard might be at fault, so I swapped everything over to another motherboard. This solved the immediate problem but..

Gigabyte GA-P35-DS4

After swapping everything over to the Gigabyte board, it would no longer boot! Grub failed to load the raid partition. After some investigation it turned out that the BIOS was confiscating part of the disc in order to store a backup of itself in a "Host Protected" Area and there was no option to disable this feature in the BIOS! After removing the HPA and changing the SATA ports the affected discs were connected to, I was able to boot from the partition, but a chunk of data on one half of the raid-1 array had been overwritten. Raid-1 has two copies of data, so losing one of them might not be fatal.

How bad was it?

A quick google didn't reveal many people who'd recovered from incidents like this but there didn't seem to be much support within the mdadm toolset for raid-1 recovery, and there didn't seem to be many pre-existing tools to help.

I needed to catalogue the extent of the damage, so I wrote a programme to compare the two constituent partitions that made up the raid array and printed a message when they differ. After identifying the sectors that differed, it was clear that it was just a few megabytes at the end of 1 disc.

It was then a simple matter to replace the corrupted data with data from the good disc, and the md array would load properly!

The next challenge was to prevent this happening again, and the simplest method ended up being changing the order that my disc drives were connected to the sata controllers, so that the bios "stole" some of a different disc for its bios backup. Have a look at the following post if you want a lot more detail about Gigabyte's unfortunate bios.

Categories
Linux

Warning about large hard discs, GPT, and Gigabyte Motherboards such as GA-P35-DS4

Or, Gigabyte BIOS considered harmful

After changing the motherboard, a computer became unbootable because the Gigabyte BIOS created a Host Protected Area (HPA) using sectors already allocated to a disc partition, corrupting the partition table and overwriting data.

What Gigabyte are trying to do is store a copy of the BIOS onto disc in order to secure the computer against viruses. The process is explained over here in the section titled "GIGABYTE Xpress BIOS Rescue™ Technology" (and is referred to in the specification of the motherboard as "Virtual Dual BIOS"). At boot time, a copy of the BIOS is stored into an unused section of the disc by creating a Host Protected Area. If a virus corrupts the BIOS then it can apparently detect this and restore the safe copy of the BIOS from the disc. Magical!

This process goes badly wrong when you have discs larger than 2TB. The traditional partition table format, which dates back to the early 80s, runs into a limit when discs reach 2TB. A new partition table format was created called GUID Partition Table (GPT) which handles much larger discs.

If you have certain Gigabyte motherboards with the Virtual Dual BIOS feature, then as the computer boots it will try and create an HPA on the first hard disc, and save the current BIOS there. One assumes that if the partition table indicates that the disc is already full, then the BIOS will avoid creating this HPA. What seems to be happening with large hard discs with GPT, is that the BIOS doesn't understand the partition table, doesn't realise that the all space on the disc is entirely accounted for already, then grabs some of the space from the end of the disc and overwrites it with a copy of the BIOS.

Once the HPA has been created, the size of the disc that is reported to the OS changes. This means that the values stored in the GPT structures don't match those reported by the disc so may make it impossible to boot from the disc - which is what happened to me.

How do I tell if I have this problem?

You are likely to see this problem if you have a disc larger than 2TB that you partitioned with a GPT on a different motherboard then attached to the Gigabyte motherboard as the first hard disc.

For me, I had a 1.5 TB disc with a /boot partition and 2 3TB discs in raid-1 via Linux software raid (mdadm). I had partitioned these discs on a different motherboard and the computer would boot without problems, however when I switched to the Gigabyte motherboard it would no longer boot. Specifically, I would receive a grub error like this:

error: disk 'mduuid/3c620ba3b6ebc2ba2dec4bdc61f7191b' not found.
Entering rescue mode...
grub rescue>

When I booted from a usb stick and attempted to mount the raid partition it was only able to load one of the discs:

ubuntu@ubuntu:~$ sudo mdadm --assemble /dev/md0
mdadm: /dev/md0 has been started with 1 drive (out of 2).
ubuntu@ubuntu:~$

I then ran gdisk to inspect each of the two discs forming the raid array, gdisk printed various errors about the disc that the BIOS had interfered with:

ubuntu@ubuntu:~$ sudo gdisk /dev/sdb
GPT fdisk (gdisk) version 0.8.8

Warning! Disk size is smaller than the main header indicates! Loading
secondary header from the last sector of the disk! You should use 'v' to
verify disk integrity, and perhaps options on the experts' menu to repair
the disk.
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************

Command (? for help):

If one performs the recommended verification step, 5 errors are detected:

Command (? for help): v

Caution: The CRC for the backup partition table is invalid. This table may
be corrupt. This program will automatically create a new backup partition
table when you save your partitions.

Problem: The secondary header's self-pointer indicates that it doesn't reside
at the end of the disk. If you've added a disk to a RAID array, use the 'e'
option on the experts' menu to adjust the secondary header's and partition
table's locations.

Problem: Disk is too small to hold all the data!
(Disk size is 5860531055 sectors, needs to be 5860533168 sectors.)
The 'e' option on the experts' menu may fix this problem.

Problem: GPT claims the disk is larger than it is! (Claimed last usable
sector is 5860533134, but backup header is at
5860533167 and disk size is 5860531055 sectors.
The 'e' option on the experts' menu will probably fix this problem

Problem: partition 2 is too big for the disk.

Identified 5 problems!

Command (? for help):

In fact there is only 1 error, which is that part of the disc has been co-opted by the BIOS. This can be seen with hdparm (contrast sdb which has been corrupted with sdd which is in its original state):

ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdb

/dev/sdb:
 max sectors   = 5860531055/5860533168, HPA is enabled
ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdd

/dev/sdd:
 max sectors   = 5860533168/5860533168, HPA is disabled
ubuntu@ubuntu:~$

 

How do I fix the problem?

You must either live with the HPA, or remove it and ensure you never again boot with a the GPT disc as primary. Even if you decide to live with the HPA you will need to temporarily remove it so that you can backup your data or resize the partition and filesystem.

So regardless of what you choose to deal with this problem, you will need to disable the HPA and make the entire disc available. This can be done with hdparm, by setting the visible sectors to the full amount with the parameter "-N p<full amount of sectors>" as below:

ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdb

/dev/sdb:
 max sectors   = 5860531055/5860533168, HPA is enabled
ubuntu@ubuntu:~$ sudo hdparm -N p5860533168 /dev/sdb

/dev/sdb:
 setting max visible sectors to 5860533168 (permanent)
 max sectors   = 5860533168/5860533168, HPA is disabled
ubuntu@ubuntu:~$

Now that this has been done, the GPT partition table will correspond to the number of sectors reported by the disc (although some data has been unavoidably lost when the BIOS overwrote the end of the disc). However if you reboot and this disc is still the primary disc then the BIOS will just create the HPA again!