LLRFS is a design for a file system that sits on top of an array of storage devices, and provides features expected of modern file systems, such as snapshots/reflinking, difference between snapshots, atomic multi-block writes, and so on.
It is composed of multiple layers, from top to bottom:
- Human File System Interface.
- Cow Storage Sharing Infrastructure.
- Hay Device Cache.
- Grass Redundant Storage.
- Turf Transactional Storage Array.
The log-structure is done solely by the Turf layer. Higher layers simply rely on transactions provided by Turf, thus removing log-on-a-log issues caused by layering, at least within the file system.
A design goal of LLRFS is to be flexible and allow adding new storage devices, of varying sizes and types, to the storage array. In principle, a user could start LLRFS on a single storage device, then upgrade by adding new devices, and replace or remove dying devices.
While reshaping the array, the user can continue to read and write the filesystem, at reduced speeds. The user cannot make further reshapes until the current reshape is finished, but the user can combine multiple changes simultaneously in a single reshape.
Rather than ask the user for a mode like "RAID1", "RAID10", "RAID5", "RAID6", etc., LLRFS instead asks the user to define a number of parameters:
max_device_fails
- the maximum number of devices that can fail before the array loses data. Can be set from 0 to 128.num_spare_devices
- how much storage to keep "on standby" as "hot spares" in case a device loses data and has to be recovered. Can be set to 0 or 1.num_device_groups
- how to group devices in stripes, i.e. form RAID50/RAID60-like structures. Can be set from 1 tonum_devices / 2
, subject to some constraints. LLRFS imposes a maximum limit on the number of devices in a stripe, so for large arrays, this will need to be increased in order to group devices.
In addition, LLRFS allows backing devices of varying size.
In principle, given the number of max_device_fails
and
num_spare_devices
there exists some combination of
partitions, RAID1/RAID5/RAID6, and SPAN, which would work
to fulfill the constraints set by the user.
For example, suppose the user has a 1 Tb disk, a 2 Tb disk, and a 2 Tb disk, and wants to tolerate at most the failure of one of the devices. The user can partition the 2 Tb disks into 1 Tb partitions, then have one pair of those 1 Tb partitions in a RAID5 with the 1 Tb disk, and have the other pair of 1Tb partitions in a RAID1, and then combine the RAID5 and RAID1 with SPAN, and get 3Tb of usable space while spending only 2Tb for redundancy. LLRFS will automatically handle unequal disk sizes similarly for a user.
Our priorities are:
- Reliability.
- Hardware cost reduction.
- Flexibility (see above).
- Performance.
As an example, with MD RAID6, in order to close the RAID6 write hole (reliability), you need to use a separate, fast device as a journal (fails hardware cost reduction). For reliability, that fast device hsa to actually be a RAID1 of fast devices (fails hardware cost reducction even more). MD does get very good performance from the use of such a journal, but note that our priorities are different. In particular, MD is perfectly willing to keep the write hole open if you do not want to buy a separate journal device (fails reliability).
As another example, with ZFS, the RAID5/6 write hole is closed by simply treating all writes as full stripe writes. However, if you use the RAID5-equivalent, you cannot extend it later with a few more drives and switch to the RAID6-equivalent (fails flexibility). LLRFS seeks to let you change your array shape as needed, and change your reliability decisions later.
LLRFS will be implemented with most of its algorithms in a
libllrfs
library, which will be licensed under MIT / Expat
license for maximum compatibility with various OS kernels.
This exposes the construction and use of an LLRFS
object,
which connects on one end to some filesystem interface,
and connects on the other end to some block device interface.
It also connects to memory management and task/thread
management.
The intent is that much of LibLLRFS
can be written and
tested in userspace programs, especially since reliability
is a high priority design goal.
Then, kernel-specific projects can link to the common
libllrfs
, filling in the filesystem interface, the block
device interface, memory management interface, and task
management interface.
For maximum chances of being accepted into various kernels, the library will be written in C. Interfaces are implemented by function pointers.