Ever since buying my first digital camera in the 1990s I’ve loved the idea of building a “master archive” of all my digital media. Over the years that archive has expanded to include music, audiobooks, video, backups of retired computers, and much more. It didn’t take long for my files to expand beyond the scope of a single hard drive, and from that moment my interest in multi-disk arrays and file systems was born.
Over the years I’ve experimented with a variety of hardware and software RAID systems. Everything I tried had annoying quirks and frustrating limitations that eventually forced me to start over with a blank array. Then I discovered BTRFS and quickly realized that my ideal file system had finally become a reality.
The Ideal File System
An ideal file system should have the following characteristics:
Free and Open Source. Anything that depends on closed-source code is tied to a single maintainer for fixes and updates over the years. If they go out of business, sell their technology to someone else, or just decide to change their ech strategy, users may find themselves unable to get support should they run into problems down the road.
Simple and foolproof. The more complicated a file system is, the more likely it will be incorrectly, potentially with dire consequences.
Portable. A disk array should be portable enough to pull out of one machine and move easily to another. This is particularly problematic with proprietary solutions like the ones often found in hardware RAID setups. If the controller fails and an identical controller is no longer available, you’re out of luck.
Efficient use of space. A multi-disk filesystem should ideally give access to the most amount of space possible without compromising redundancy. Most RAID systems require that all disks in an array are the same size to achieve this. That requirement can be very limiting when trying to expand an array over time. An ideal filesystem will make efficient use of whatever storage you give it.
Error-correcting. Modern hard drives hold a truly staggering amount of data. Over time data corruption is bound to occur for reason ranging from electrical surge, impact, friction, heat, even cosmic rays. An ideal filesystem should assume that errors will appear periodically and be able to identify and correct them without data loss.
BTRFS is the closest thing we have to my ideal file system. It’s not perfect but it’s close, and getting closer with each new release.
Hands-on with BTRFS
I set up a Raspberry Pi 3 with a set of drives to use as a BTRFS testing station. Six drives are connected via USB2. This kind of arrangement represents an extremely low-budget configuration of old, mismatched drives connected over a slow bus to an underpowered device. If BTRFS performs well on this hardware it will perform well anywhere.
The drive list:
[0:0:0:0] disk HP External HDD 2002 /dev/sda (500GB) [1:0:0:0] disk TOSHIBA External USB 3.0 0 /dev/sdb (1000GB) [2:0:0:0] disk ST375064 0AS /dev/sdc (750GB) [3:0:0:0] disk Generic External 1.04 /dev/sdd (320GB) [4:0:0:0] disk WDC WD10 EADS-00L5B1 /dev/sde (1000GB) [4:0:0:1] disk WDC WD10 EACS-00ZJB0 /dev/sdf (1000GB)
To get started, I wiped all the drives with
dd if=/dev/zero of=/dev/sdX
…before beginning this exercise.
Create a BTRFS filesystem on the /dev/sda
Mount the new filesystem and copy a few files over to it. Copying files is not strictly needed but it serves to showcase how BTRFS can be manipulated while mounted and active.
mount /dev/sda /tank
I’ve copied over a video file of Big Buck Bunny, which is 277MB.
Take a look at the filesystem data:
btrfs filesystem df /tank
Data, single: total=1.01GiB, used=271.76MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=400.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B
The Data line shows that data written to the drive will use a single copy, meaning there is no redundancy. Since there’s only a single drive in use there’s not much point in keeping multiple copies of everything since the drive hardware still provides a single point of failure. However, Metadata is configured to maintain a duplicate copy of metadata (“DUP”) even in a single-drive configuration. Since metadata takes up a small fraction of the drive space it’s considered worthwhile to protect against random corruption.
Notice the Data line reports a total of 1.01 GiB despite this being a 500GB drive. Unlike most RAID systems, BTRFS applies RAID levels to block groups rather than entire devices. The default block group size for data is 1 GiB. When I copied a file over to the drive, BTRFS allocated its first 1 GiB block group to store the data. As more data is copied, more block groups will be allocated. This method allows multiple block groups to span devices of different sizes efficiently.
The flexibility of BTRFS can make it difficult to determine exactly how much free space is available. BTRFS provides a utility to break down filesystem usage to help address this problem:
btrfs filesystem usage /tank
For a full explanation of free space reporting, consult the BTRFS wiki.
Let’s add a second drive and convert the filesystem into a multi-drive array.
btrfs device add /dev/sdb /tank
The new disk is now part of the pool and new space is available. The filesystem is now the equivalent of a traditional RAID group in span mode. BTRFS allows us to change raid levels to add redundancy on the fly. Let’s convert the array into a mirror, or RAID1 group. In BTRFS, the file data and metadata can each have different RAID levels. Use the following command to convert both to RAID1.
btrfs balance start -dconvert=raid1 -mconvert=raid1 /tank
Now let’s take another look at the filesystem data:
btrfs filesystem df /tank
Data, RAID1: total=1.00GiB, used=271.76MiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=1.00GiB, used=400.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
Since we’ve requested RAID1 and the two drives are different sizes, BTRFS makes as much drive space available as possible while guaranteeing redundancy. That mean BTRFS will mirror the size of the smaller drive and cut off access to the remaining space on the larger drive.
You can use the BTRFS filesystem calculator to find out in advance how much space any particular configuration will give you.
Let’s add another drive.
btrfs device add /dev/sdc /tank
Now we have a 3 drive array. The “filesystem df” command won’t list the actual drives, but “filesystem show” does.
btrfs filesystem show
Label: none uuid: 81cb3d1d-f9e3-4082-8b5c-35d7245ed479 Total devices 3 FS bytes used 272.17MiB devid 1 size 465.11GiB used 2.03GiB path /dev/sda devid 2 size 931.51GiB used 2.03GiB path /dev/sdb devid 3 size 698.64GiB used 0.00B path /dev/sdc
We have even more options available with a 3 drive array. Let’s convert the data to RAID5. We’ll keep the metadata at RAID1 since it takes up so little space.
btrfs balance start -dconvert=raid5 /tank
btrfs filesystem df /tank
Data, RAID5: total=2.00GiB, used=272.01MiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=1.00GiB, used=400.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B
This is where BTRFS really gets interesting. Most RAID systems operate on entire disks and require matching disk sizes to create an array. If larger disks are present, they will cut down the size of larger disks to match the smallest one in the set.
The three disks in my array are 500GB, 750GB, and 1000GB. If I were using ZFS, md, or a hardware RAID controller, the system would treat the system as three 500GB drives and create a RAID5 array with 1000GB available space. Since BTRFS operates on smaller block groups it can make a three disk RAID5 using 500GB of each drive, then create a second two disk RAID5 (essentially a RAID1) over the remaining space on the larger drives. The result is a combined RAID5 array of 1250GB over the three drives.
BTRFS is the only free, open source RAID system that allows this kind of flexibility. It allows you to essentially cobble together a fault-tolerant array using whatever drives you have lying around. When it’s time to add a new drive you can buy whatever’s available and incorporate it into your array without having to rebuild the array or even take it offline.
Let’s add another disk:
btrfs device add /dev/sdd /tank
The new disk is 320GB. Now we’ve got 4 drives, all of different sizes.
Label: none uuid: 81cb3d1d-f9e3-4082-8b5c-35d7245ed479 Total devices 4 FS bytes used 272.42MiB devid 1 size 465.11GiB used 2.03GiB path /dev/sda devid 2 size 931.51GiB used 2.03GiB path /dev/sdb devid 3 size 698.64GiB used 1.00GiB path /dev/sdc devid 4 size 298.09GiB used 0.00B path /dev/sdd
Notice that the new disk shows that no storage is being used. There’s no need to restripe the array when adding a new disk, and the operation completed in a few seconds. I can start using the new space immediately, or I can start a balance operation to spread the data evenly onto the new disk.
btrfs balance start /tank
Using a traditional RAID system that operates on the disk level I would have 960GB of space available. With BTRFS I have 1570GB.
Let’s say I only have 4 slots available for drives and I want to get rid of the 320GB drive to make room for a new larger drive. As long as you have enough free space available on the remaining drives, BTRFS will let you remove a drive and redistribute the data to preserve parity.
btrfs device delete /dev/sdd /tank
Label: none uuid: 81cb3d1d-f9e3-4082-8b5c-35d7245ed479 Total devices 3 FS bytes used 272.43MiB devid 1 size 465.11GiB used 1.00GiB path /dev/sda devid 2 size 931.51GiB used 2.03GiB path /dev/sdb devid 3 size 698.64GiB used 2.03GiB path /dev/sdc
Now that we’re back to three drives I can add a new larger one…
btrfs device add /dev/sde /tank
Label: none uuid: 81cb3d1d-f9e3-4082-8b5c-35d7245ed479 Total devices 4 FS bytes used 272.43MiB devid 1 size 465.11GiB used 1.00GiB path /dev/sda devid 2 size 931.51GiB used 2.03GiB path /dev/sdb devid 3 size 698.64GiB used 2.03GiB path /dev/sdc devid 4 size 931.51GiB used 0.00B path /dev/sde
Again, I can rebalance the array but it’s completely optional.
Now that I have incorporated a larger drive, BTRFS gives me even more usable space: 2250GB.
This flexibility means you can choose the highest RAID level available for the amount of space you actually need in your array. If I knew I only needed 1000GB at the moment, I could convert to RAID6 for two disk redundancy. Once my array fills up I can revert to RAID5 to make more space available. Or I can replace a smaller drive, or add a new one.
The roadmap for BTRFS includes even more flexible options, such as the ability to create separate directories on an array with different RAID levels, or even the ability to set differing RAID levels for an individual file!
In addition to flexible RAID levels, BTRFS features writable snapshots, subvolumes, fault-tolerant writing (copy-on-write), and a variety of other modern features. The only real downside is that it’s relatively new software under active development so it can be a little flaky under certain conditions. However, for a home media server it’s reliable, flexible, and easy to use. Keep an offsite backup to be safe, and give BTRFS a try today!
RockStor, a BTRFS-powered home NAS
The BTRFS Sysadmin’s Guide