Linux Filesystems and Storage – Chris Mason
Chris is the lead btrfs developer
History of Linux –
2.2 kernels: mostly smp capable, 1024 process limit, no journaled fs’s (no crash system recovery), no source control
2.4: Enterprise ready! SMP scalability (start), journaled fs’s, 4g RAM, Raw IO (DBs could talk more directly to storage), no source control
2.6: even more enterprise ready, long dev series, MUCH ARGUING. Source control at last, got scaling. NUMA, block layer, page cache, networking, mm subsystem, AIO/DIO
Git: Written for kernel devs, by kernel devs. Constant incremental changes, new APIs and functionality, across all subsystems
2.5 years different in enterprise kernels and mainline, so big features are backported from mainline into Enterprise kernel
Why are there so many filesystems!!!???
reflection of how many people using Linux, many different ways to store things
its easy to add a filesystem
Where we are now:
ext4 : modernized ext, used at google, targets embedded (cells) and large systems, some static limitations (biggest = # of inodes has to be pre-allocated)
xfs: only filesystem that shrinks, significant metadata performance improvements, best scalability for large files & large systems
btrfs: made to provide features not found in other fs’s, snapshots are writable (copy on write metadata and data), add remove devices, shink fs online, support volume mgmt. Good overall performance, working on scalability issues.
Device mapper (backbone of LVM): thin provisioning, better snapshots, ssd front ends (under dev), simplified mgmt tools – storage pool out of LVM (interesting)
Compact flash: need to be better at not wrecking phones. Right now normal fs on compact flash doesn’t always work.
Block: highest storage perfomance storage (SSD) sends bios directly via device driver, linear scalability coming, but downside is this bypasses features in disc elevators & SCSI layer (multipathing, T10 PI, etc).
SCSI: Big downside now is performance. But lots of good – strong support for every device types.
NFS: still the network FS, revisions intro new features (pNFS, NFS 4.12..)
Atomic writes – storage is smart enough to give atomic writes down to the media (write is all or nothing). Harder to use than you may think (esp from an FS point of view).
Copy offload: block range cloning in storage, or copy offload by the storage. Standards work in in progress (token based)
Shingled devices: drive packs things close enough together that it no longer promises the magnet won’t write outside the lines.
Hinting: data tiers. Connect blocks likely to be freed at the same time (think flash, erasure coding…). Basically give hints about how the data is going to be used (to help with data tiering). Work needed on feedback loops so filesystems understand how well they are doing with the hints.
Flash: Fundamentally different way of doing IO – not just faster. Removes a big constraint – locality.
And he mentioned Ceph! Right on!!