Non-destructively testing repairs to damaged filesystems

Let's say you have a damaged, unmountable btrfs filesystem. You run btrfs check which tell you about some problems, but the documentation strongly advises you not to run btrfs check --repair.

WARNING: the repair mode is considered dangerous and should not be used without prior analysis of problems found on the filesystem.

While btrfs check tells you the problems it finds, it does not tell you what changes it would make. It would be great if "repair" had a mode that would tell you what changes it would make, without committing yourself to those changes. Sadly there is no such mode - so what options do we have?

Brute force?

The brute force approach is to back up the devices comprising the damaged btrfs filesystem to another location, so that they could be restored if btrfs check --repair doesn't work out. This could be a lot of data to backup and restore, especially if the filesystem consists of multiple volumes comprising multiple terabytes of data.

What about...

Instead of brute force we could use virtualised discs and a hypervisor. Qemu's qcow2 disc format lets us set up virtual discs with a backing volume set to the damaged drive/partition. This means that virtual disc reads come from the backing volume, but any writes are made to the virtual disc rather than the backing volume. This means that we don't need huge quantities of storage to back up the data, and if we don't like the changes that were made by "repair" or other tools we can easily reset by deleting the virtual discs or by using qemu's snapshot feature.

This copy-on-write technique can be used not just with btrfs and "btrfs check --repair" but with other filesystems and other tools to repair those filesystems. For example, were you to edit some of the filesystem structure with a programme you'd written yourself, as I talk about in another post.

A complication with btrfs filesystems is that a device id is stored inside volumes of a filesystem. When mounting/checking btrfs will scan all discs and partitions to find devices, and mount them if the device ids match, so you do not want both the original disc and the virtual disc to be visible. You do this inside a virtual machine simply by ensuring only the virtual discs are mapped into the virtual machine. That is, don't map in the original file systems as raw discs!

Setting up a kvm virtual machine

You'll need a virtual machine to mount these virtual filesystems. If you don't already have a virtual machine set up, you'll need to create one. As we're using qemu virtual discs, this means you need to use the kvm/qemu/libvirtd hypervisor stack. There are several ways to set up a virtual machine and many good guides on the web about how to do this, so I won't repeat them here but I'll just say that on my Ubuntu system I used uvtool to quickly set up the VM to test my filesystem changes.

Setting up copy-on-write virtual discs

The qemu-img command creates these virtual filesystems. If a btrfs filesystem comprises multiple discs or partitions, we need to create a separate virtual disc for each physical disc/partition.

The following command creates a virtual disc that uses /dev/sdd2 as its backing file, and any changes to this virtual disc are written to the file ssd2_cow.qcow2.

qemu-img create -f qcow2 -o backing_file=/dev/sdd2 -o backing_fmt=raw sdd2_cow.qcow2

I ran similar comands once for each underlying disc/partition that was part of the btrfs filesystem.

To attach one of these discs disc to an existing VM the basic command is:

virsh attach-disk <VM NAME> /home/user/sdd2_cow.qcow2 vde --config --subdriver qcow2

However on ubuntu, I ran into some (reasonable) restrictions imposed by apparmor. I would receive an error attempting to attach the virtual discs, with an error from apparmor showing up in the logs:

[56880.068334] audit: type=1400 audit(1673655958.288:119): apparmor="DENIED" operation="open" profile="libvirt-c712b749-0f68-413f-9bd7-76ea061808eb" name="/dev/sde2" pid=58733 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=64055 ouid=64055

The problem is that the hypervisor is by default prevented from accessing the host's raw discs. When the hypervisor attempts to read the backing file (or later, to lock it) apparmor prevents this. We solve this by adding lines to /etc/apparmor.d/local/abstractions/libvirt-qemu for each raw disc or partition that backs one of our virtual discs.

/dev/sde2 rk,

This line allows libvirt/qemu to open the raw disc for read access (r), and allows the raw disc to be locked (k). We do not grant write access to libvirt/qemu so this provides an extra line of defence against unintended changes to our original filesystems.

Once apparmor was configured, I could attach these discs to my virtual machine, and access the virtual devices at locations such as /dev/vde.

Final thoughts

With this or similar techniques you can safely test fixes to damaged filesystems without risking the underlying data. Once you're happy that your fix will have the desired effect, you can apply it to the real filesystem.


Mitigating bitrot and corruption in a linux md raid-1 array

A while ago I had the misfortune of running into data corruption issues with two different motherboards both of which caused corruption in my mdadm raid-1 array. This is the overview of how I identified corrupt files and healed the array, which I hope may provide a useful signpost to anyone else who runs into similar issues. At the end of this article I link to another post where I provide more detail about the problems with a particular Gigabyte motherboard.


The first, an ASUS P5QL-Pro would occasionally misread bytes 14 or 15 (mod 16). It had been doing this for several months before I gained a full understanding of what was happening. I experienced occasional processes crashing for no clear reason, a downloaded iso which,when burnt to a USB stick and booted, was corrupt. I had also noticed around the same time, that transmission (the torrent client I had used) would occasionally state that already downloaded files were corrupt, so I would have to recheck them and download any corrupt parts. I blamed transmission for having buggy torrent code. Eventually I noticed that if I copied a large file, its md5sum wouldn't be the same as the original. I was able to simplify the issue in the end merely by running md5sum twice on the same 64GB file and observing different checksums each time.

After trying various combinations of hard discs, sata cables, sata ports and filesystems, I ruled out the possibility of a failing hard disc, a bad sata cable, a damaged sata port or a Linux filesystem bug. I also used the most recent version of md5sum and sha256sum, ran memtest86, used a sata card instead of the onboard sata ports, and reproduced the issue after booting from a usb stick. This eliminated software issues in md5sum, bad memory, a bad sata controller chip and a bad Linux installation. I suspected that the motherboard might be at fault, so I swapped everything over to another motherboard. This solved the immediate problem but..

Gigabyte GA-P35-DS4

After swapping everything over to the Gigabyte board, it would no longer boot! Grub failed to load the raid partition. After some investigation it turned out that the BIOS was confiscating part of the disc in order to store a backup of itself in a "Host Protected" Area and there was no option to disable this feature in the BIOS! After removing the HPA and changing the SATA ports the affected discs were connected to, I was able to boot from the partition, but a chunk of data on one half of the raid-1 array had been overwritten. Raid-1 has two copies of data, so losing one of them might not be fatal.

How bad was it?

A quick google didn't reveal many people who'd recovered from incidents like this but there didn't seem to be much support within the mdadm toolset for raid-1 recovery, and there didn't seem to be many pre-existing tools to help.

I needed to catalogue the extent of the damage, so I wrote a programme to compare the two constituent partitions that made up the raid array and printed a message when they differ. After identifying the sectors that differed, it was clear that it was just a few megabytes at the end of 1 disc.

It was then a simple matter to replace the corrupted data with data from the good disc, and the md array would load properly!

The next challenge was to prevent this happening again, and the simplest method ended up being changing the order that my disc drives were connected to the sata controllers, so that the bios "stole" some of a different disc for its bios backup. Have a look at the following post if you want a lot more detail about Gigabyte's unfortunate bios.


Warning about large hard discs, GPT, and Gigabyte Motherboards such as GA-P35-DS4

Or, Gigabyte BIOS considered harmful

After changing the motherboard, a computer became unbootable because the Gigabyte BIOS created a Host Protected Area (HPA) using sectors already allocated to a disc partition, corrupting the partition table and overwriting data.

What Gigabyte are trying to do is store a copy of the BIOS onto disc in order to secure the computer against viruses. The process is explained over here in the section titled "GIGABYTE Xpress BIOS Rescue™ Technology" (and is referred to in the specification of the motherboard as "Virtual Dual BIOS"). At boot time, a copy of the BIOS is stored into an unused section of the disc by creating a Host Protected Area. If a virus corrupts the BIOS then it can apparently detect this and restore the safe copy of the BIOS from the disc. Magical!

This process goes badly wrong when you have discs larger than 2TB. The traditional partition table format, which dates back to the early 80s, runs into a limit when discs reach 2TB. A new partition table format was created called GUID Partition Table (GPT) which handles much larger discs.

If you have certain Gigabyte motherboards with the Virtual Dual BIOS feature, then as the computer boots it will try and create an HPA on the first hard disc, and save the current BIOS there. One assumes that if the partition table indicates that the disc is already full, then the BIOS will avoid creating this HPA. What seems to be happening with large hard discs with GPT, is that the BIOS doesn't understand the partition table, doesn't realise that the all space on the disc is entirely accounted for already, then grabs some of the space from the end of the disc and overwrites it with a copy of the BIOS.

Once the HPA has been created, the size of the disc that is reported to the OS changes. This means that the values stored in the GPT structures don't match those reported by the disc so may make it impossible to boot from the disc - which is what happened to me.

How do I tell if I have this problem?

You are likely to see this problem if you have a disc larger than 2TB that you partitioned with a GPT on a different motherboard then attached to the Gigabyte motherboard as the first hard disc.

For me, I had a 1.5 TB disc with a /boot partition and 2 3TB discs in raid-1 via Linux software raid (mdadm). I had partitioned these discs on a different motherboard and the computer would boot without problems, however when I switched to the Gigabyte motherboard it would no longer boot. Specifically, I would receive a grub error like this:

error: disk 'mduuid/3c620ba3b6ebc2ba2dec4bdc61f7191b' not found.
Entering rescue mode...
grub rescue>

When I booted from a usb stick and attempted to mount the raid partition it was only able to load one of the discs:

ubuntu@ubuntu:~$ sudo mdadm --assemble /dev/md0
mdadm: /dev/md0 has been started with 1 drive (out of 2).

I then ran gdisk to inspect each of the two discs forming the raid array, gdisk printed various errors about the disc that the BIOS had interfered with:

ubuntu@ubuntu:~$ sudo gdisk /dev/sdb
GPT fdisk (gdisk) version 0.8.8

Warning! Disk size is smaller than the main header indicates! Loading
secondary header from the last sector of the disk! You should use 'v' to
verify disk integrity, and perhaps options on the experts' menu to repair
the disk.
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.

Command (? for help):

If one performs the recommended verification step, 5 errors are detected:

Command (? for help): v

Caution: The CRC for the backup partition table is invalid. This table may
be corrupt. This program will automatically create a new backup partition
table when you save your partitions.

Problem: The secondary header's self-pointer indicates that it doesn't reside
at the end of the disk. If you've added a disk to a RAID array, use the 'e'
option on the experts' menu to adjust the secondary header's and partition
table's locations.

Problem: Disk is too small to hold all the data!
(Disk size is 5860531055 sectors, needs to be 5860533168 sectors.)
The 'e' option on the experts' menu may fix this problem.

Problem: GPT claims the disk is larger than it is! (Claimed last usable
sector is 5860533134, but backup header is at
5860533167 and disk size is 5860531055 sectors.
The 'e' option on the experts' menu will probably fix this problem

Problem: partition 2 is too big for the disk.

Identified 5 problems!

Command (? for help):

In fact there is only 1 error, which is that part of the disc has been co-opted by the BIOS. This can be seen with hdparm (contrast sdb which has been corrupted with sdd which is in its original state):

ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdb

 max sectors   = 5860531055/5860533168, HPA is enabled
ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdd

 max sectors   = 5860533168/5860533168, HPA is disabled


How do I fix the problem?

You must either live with the HPA, or remove it and ensure you never again boot with a the GPT disc as primary. Even if you decide to live with the HPA you will need to temporarily remove it so that you can backup your data or resize the partition and filesystem.

So regardless of what you choose to deal with this problem, you will need to disable the HPA and make the entire disc available. This can be done with hdparm, by setting the visible sectors to the full amount with the parameter "-N p<full amount of sectors>" as below:

ubuntu@ubuntu:~$ sudo hdparm -N /dev/sdb

 max sectors   = 5860531055/5860533168, HPA is enabled
ubuntu@ubuntu:~$ sudo hdparm -N p5860533168 /dev/sdb

 setting max visible sectors to 5860533168 (permanent)
 max sectors   = 5860533168/5860533168, HPA is disabled

Now that this has been done, the GPT partition table will correspond to the number of sectors reported by the disc (although some data has been unavoidably lost when the BIOS overwrote the end of the disc). However if you reboot and this disc is still the primary disc then the BIOS will just create the HPA again!