I recently started to replace the HDD storage of my home server since my three WD RED 4TB drives got rather old and i required more space. After lots of experimenting i ended up with ZFS, three new HGST 10TB drives and a shiny Optane 900p. Here is my story so far.
There are many videos, articles and other documentation out there, describing in detail what ZFS is. Lets make this brief. ZFS is a copy-on-write file system created by Sun Microsystems for Solaris and available under a Open-Source (-ish) license for Linux and other operating systems. It combines the abilities of a volume manager (like LVM) with a file system (like ext4). Compared to most other file systems, it natively handles multi-device situations by creating all kinds of stripes, mirrors and parity based constructs for data redundancy. Unlike any other file system (yes i know BTRFS…) it sets its priority on data consistency, self-healing capabilities, error prevention and has a proven track-record in the enterprise storage industry.
ZFS works best when exposing disks directly to it, favouring a “JBOD” configuration over RAID controllers. It strictly is NOT “Software RAID / Ghetto RAID”, in fact it offers feature no other file system or hardware RAID controller can offer. Lets face it, RAID controllers are just expensive, optimized computers with crappy, often incompatible firmware and a bunch of SATA/SAS connectors. Since i evaluated multiple solutions (Linux MD, a LSI 9260-8i hardware controller, BTRFS and ZFS) i dare to have an opinion on that topic. The only thing ZFS does not have is a battery-backup unit (“BBU”), however the risk of losing any data during a power outage is extremely low and data corruption can not happen with ZFS. A external UPS is a lot cheaper than a entry level RAID controller with BBU. This only leaves PSU failures, cable errors and software bugs as risk.
As usual there are concessions to make - for ZFS that was higher resource usage (and subsequently potentially lower performance), compared to file systems that care less about data integrity. It has to go many extra miles to make sure data is not just received from disks but the data is actually the correct one, intact, unmodified and gets repaired in case its corrupted. This by the way means using ECC RAM is a very good idea, as faulty data in RAM would lead to “incorrectly repaired” (aka. corrupted) data. Optional features like compression, de-duplication and encryption take an extra toll. ZFS has intelligent caches which are quite memory hungry and can easily use 16GB of available RAM even on small systems. That being said, unused RAM is wasted RAM and its important to understand what ZFS is using it for. To offload some of this resource usage, ZFS allows a second level of caching being written to non-volatile memory called the L2ARC (“Level 2 adaptive replacement cache”) which acts similar to a “read cache”. Next there is a mechanism called ZIL (“ZFS intent log”) which is similar to a “write cache” that collects and streamlines write operations and ZFS then flushes them to disk every couple of seconds.
Performance of ZFS can be greatly enhanced by using a SLOG (“separate log device”) for ZIL and also offload L2ARC to high-speed, low-latency storage. Since DRAM is volatile it’s not a consideration, except some super expensive battery/capacitor buffered DRAM devices. SSDs are a lot more affordable, non-volatile by nature and really fast compared to hard drives. However, compared to DRAM, SSDs are several multitudes slower. Just recently a new technology has been released, claiming to fit between DRAM and traditional SSDs and therefor be an obvious choice for ZIL and L2ARC: Intel Optane.
- It’s a product range based on 3D-XPoint memory
- It’s built for very specific use-cases like caching
- It’s cheaper than DRAM but more expensive as typical SSDs
- It uses proprietary memory tech from Intel and Micron
- It’s NOT a typical SSD, since it’s not based on NAND flash
- It’s NOT DRAM, since it’s non-volatile
3D-XPoint “3D cross-point” memory technology was announced years ago and first products, called “Optane”, hit the market in early 2017. The first release was a datacenter-grade memory product called “Optane SSD DC P4800X”, available as 375GB and 750GB capacities and as U.2 drive and PCIe card formats. Roughly at the same time some much more consumer oriented “Optane Memory” M.2 cards became available as 16GB and 32GB configuration. In late 2017 Intel released the “Optane SSD 900p” product with capacities of 280GB and 480GB as PCIe card and U.2 drive.
While all Optane products are based on 3D-XPoint memory, their scope and performance varies a lot. Those small “Optane Memory” M.2 cards are meant to serve as system cache/accelerator for HDD-based desktop and mobile computers, while the P4800X and 900P are targeting server and enthusiast desktop computing. The latter two use much more power but also deliver significantly better performance as they pack more 3D-XPoint modules and speedier controllers. The P4800X is Intels top-of-the-line offering and comes with more integrity checks, capacitor based buffer to avoid data loss and better durability. Performance-wise it’s rather close to the 900p, and both share stunning specs.
- 2500MB/s read, 2000MB/s write
- 500.000 IOPS read and write
- 10usec latency for read and write
- 5PBW, 1.6M hours MTBF
- 1 sector per 10^17 bits read uncorrectable
Intel claims that those cards require a 7th generation Intel Core CPU, which is just half of the truth. In fact those drives use the NVMe protocol and can be used as regular block device with any current CPU and platform. To run Intels software for automated caching indeed a 7th generation Intel Core CPU is enforced, which appears to be a sales oriented decision. Anyway, for my use-case the 900p meets a 5th generation Xeon E3 CPU on a C232 chipset - and it just works fine.
Now, whats the fuzz about? Why is Optane spectacular? When looking at the typical benchmarks, Optane based products deliver okay-ish performance compared to NAND-based NVMe SSDs like a Samsung 960 Pro - but come as a steep price premium. SSD Benchmarks usually assume large block sizes (>=1M) and high queue-depth (>=16). These values do not represent typical server workloads, in fact i dare to claim they represent almost no relevant workload and are made up by vendors to present large numbers. NAND based SSDs are great in producing high throughput when reading large quantities off many NAND chips in parallel (sequential access), and this is a good thing. However, the fun starts at small block sizes (e.g. 4K) and low queue depths (e.g. 2 or 4) often seen at server workloads like databases. Consumer grade NAND SSDs are usually also terrible at random write performance. Intel claims Optane can fix that.
Disclaimer: I’ve not received any freebies or been in contact with any of the brands mentioned here. All stuff has been bought from my own money. I understand benchmarks can be in-comprehensive and i admit that the SM951 was in use for some years so it might not produce perfect results anymore. Also the system was running some load during the benchmark and potentially lacking optimization. While my results might not be scientifically perfect, they represent a real-world configuration.
Lets have a look at a Samsung SM951 running at the same system as a Intel Optane SSD 900p, both connected via PCIe x4:
1M blocksize, QD16, random read
$ fio --name=test1M --filename=test1M --size=10000M --direct=1 --bs=1M --ioengine=libaio --iodepth=16 --rw=randread --numjobs=2 --group_reporting --runtime=5
* 900p: 2563 IOPS, 2536 MB/s, 1247 usec avg. latency
* SM951: 2005 IOPS, 2005 MB/s, 1594 usec avg. latency
So far so good, both products are almost toe to toe while the 900p delivers a bit more performance justifying its higher price point. Note that both products appear to maxed out regarding bandwidth. Now, lets write some data.
1M blocksize, QD16, random write
$ fio --name=test1M --filename=test1M --size=10000M --direct=1 --bs=1M --ioengine=libaio --iodepth=16 --rw=randwrite --numjobs=2 --group_reporting --runtime=5
* 900p: 2152 IOPS, 2152 MB/s, 1485 usec avg. latency
* SM951: 399 IOPS, 409 MB/s, 7981 usec avg. latency
Things start to become interesting as the 900p suddenly pulls away with 5x higher IOPS while still being maxed out and bandwidth. Write intense workloads are obviously an issue for consumer NAND SSDs.
As said before, 1M block sizes and a queue-depth of 16 are unusual for server workloads, lets lower the block size to 4K:
4K blocksize, QD16, random read
$ fio --name=test4k --filename=test4k --size=10000M --direct=1 --bs=4k --ioengine=libaio --iodepth=16 --rw=randread --randrepeat=1 --rwmixread=75
* 900p: 310227 IOPS, 1211 MB/s, 51 usec avg. latency
* SM951: 177432 IOPS, 710 MB/s, 90 usec avg. latency
Again, the SM951 does a good job in reading, however the gap becomes a lot bigger. The 900p now delivers 75% better throughput. Let’s write some data…
4K blocksize, QD16, random write
$ fio --name=test4k --filename=test4k --size=10000M --direct=1 --bs=4k --ioengine=libaio --iodepth=16 --rw=randwrite --randrepeat=1 --rwmixread=75
* 900p: 188632 IOPS, 755 MB/s, 84 usec avg. latency
* SM951: 22012 IOPS, 88 MB/s, 712 usec avg. latency
While 22k IOPS are still very respectable from the SM951, the 900p again obliterates it, now producing about 9x higher performance.
Those numbers being crunched, NAND based SSDs remain to be great products, just not for every workload and use-case. 3D-XPoint clearly defines a new standard for such workloads, somewhere in between DRAM and NAND.
Back to specs, the 900p’s endurance is rated as 5PBW (five petabytes written) compared to 400TBW (four hundred terabytes written) of the SM951. The datacenter focused P4800X is even rated at 20PBW. To be fair on specs, the 900p uses a lot more power (5W idle, 14W load) compared to 40mW idle and 5W load of the Samsung and other NAND SSDs.
Both the latency advantage and higher durability make 3D-XPoint based products a very interesting device for enterprise workloads and caching. Therefor i decided to get a 900p and use it as cache device for my home server. Before doing so yourself, consider that Optane is a 1st generation product, there are likely to be improved cards around the corner.
The server runs a bunch of KVM managed by Proxmox, sports a E3-1260L CPU, 32GB of DDR4 ECC memory and a P10S-I board.
Creating the primary storage pool is quite straight forward:
$ zpool create -O compression=lz4 -O normalization=formD -o ashift=12 storage raidz1 ata-HGST_HUH721010ALN600_1SJ5HXXX ata-HGST_HUH721010ALN600_1SJ5JXXX ata-HGST_HUH721010ALN600_1SJ6KXXX
compression=lz4means LZ4 compression is used on compressible data. ZFS will find out if a block is actually compressible.
normalization=formDmeans file names are stored as normalized UTF-8
ashift=12means native 4K blocks are used, which my drives feature
raidz1means the provided drives are organized in a way traditional RAID5 does, storing a parity as redundancy to allow recovering one failed drive
ZFS is quite reasonably configured by default, however there are a few useful knobs to adjust to both workload and hardware. Please always verify that a change has positive impact and adjust, there is no perfect universal config otherwise this would be the default anyway. I’ll write a separate post about file-system tuning in a broader scope.
To use the Optane 900p as caching devices, i created a GPT partition table with a 10GB ZIL (“log”) and 120GB L2ARC (“cache”) partition. Adding them to the pool is easy:
$ zpool add storage log nvme-INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part1
Now my pool looks like this:
$ zpool status -v
I was previously using the “qcow2” disk format on ext4, which is now a bad idea since ZFS is already a copy-on-write system. Those images can easily be transformed to RAW images and dd’ed back to the ZFS dataset.
$ qemu-img convert -f qcow2 -O raw vm-100-disk-1.qcow2 vm-100-disk-1.raw
ZFS allows to create sparse datasets, which will only grow if their space is actually used. Since zeros are highly compressible, writing and deleting a large “zero file” within the VMs can actually free up ZFS storage. After moving to RAW images, run the following within the VM:
$ dd if=/dev/zero of=zerofile bs=1M
Since i’m running virtual machines, there is another thing which should go to low-latency storage: swap. I try to conserve as much memory as possible, which means VMs sometimes use their swap space, which gets horribly slow in case it resides on spinning disks. For that reason i created another partition, created a separate ZFS pool and created disk images that will hold the VMs swamp data.
Creating a new pool is very simple and as i don’t need redundancy on swap it will just be one “device”, actually a partition. Using unique hardware identifiers instead of device paths (e.g. “/dev/nvme0n1p3”) is quite helpful as PCIe enumeration and partition order may change.
$ zpool create -O normalization=formD -O sync=always swaps INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part4
Now new virtual disks are created on this ZFS pool and get attached to their virtual machine.
$ zfs list
Replacing old swap and re-claiming that space for the root partition is easy if the VMs are using LVM.
/dev/sdb is the new virtual device available to the VM, stored at the ZFS “swaps” pool on Optane.
Add the new swap space to LVM:
$ pvcreate /dev/sdb
Create the swap file system and use the UUID as device identifier in
$ mkswap /dev/vm-optane/swap
Disable and remove the old swap partition:
$ swapoff /dev/vm-system/swap
Extend the root partition and file system to use the free’d up space:
$ lvextend -l +100%FREE /dev/vm-system/root
…and reboot the VM, just to be sure the file system is undamaged.