A generic piece of advice on tuning
ZFS is a mature piece of software, engineered by file- and storage-system experts with lots of knowledge from practical experience. Sun invested a lot of money and built enterprise grade appliances around it for a decade. Always keep this in mind when optimizing it, there is reason to trust that defaults are chosen very reasonably even though they may not appear obvious at first look. Mind that enterprise class storage systems are primarily about safety and not about speed at all cost. If there would be settings that were better in all regards, someone would have already figured out and made them the default.
That being said, there is usually a chance to optimize software in order to fit the specific workload and environment. Doing so can save a lot of money compared to throwing more expensive hardware on a problem. This process is based on knowledge, context, testing methodology and goals. Always verify that a change has actual holistic positive impact and adjust back if not. Optimizing complex systems requires a systematic approach, which is not pasting every setting that has been suggested on the internet. It’s very likely that random settings which worked for a specific case won’t yield any improvement for another case but instead introduce problems or even data loss. The same applies to any suggestion given at this article, it could be totally worthless for you to replicate settings that worked well for me.
Before actually changing anything, make sure you understand the underlying concepts, have read or listened to all relevant documentation and are in a position to second-guess suggestions made by others. Make sure you understand the context in which ZFS is operating and define plausible success criteria. It does not make sense to aim for 2000 IOPS at 4K blocks out of a single HDD or expect 1GB/s throughput on encrypted storage on a Raspberry. It’s also not useful to expect the same kind of performance for any given workload since each configuration and optimization stands for itself. If you don’t know how block storage works in general or what parameters are relevant to measure and rate storage systems, then please gather this knowledge. Only if you can say with confidence that you understand what you are doing, why you are doing it, what you roughly expect and have found a proper testing methodology, only then you should attempt to “tune” a complex system such as ZFS.
Context
I’m using a 3-way RAIDZ1 array with HGST HUH721010ALN600 disks (10TB, 7200rpm, 4Kn) and a Intel Optane 900p card as ZIL/L2ARC within a entry-level server (E3-1260L, 32GB, 2x1Gbps) running Debian Linux and Proxmox/KVM for virtualization. Virtual machines (currently 10) run headless Debian Linux and provide general purpose residential services such as Mail, File, Web, VPN, Authentication, Monitoring etc. This article was written while running ZFS on Linux “ZoL” 0.7.6.
Current situation
Storage access within VMs is terribly slow and the host system shows high on IOwait numbers. Especially encrypted disks almost flat-line when moving some data around.
Defining success criteria
My goal is to fully saturate one of the servers 1Gbps links with a 10GB file transfer from within a virtual machine doing full-disk encryption. I want my VMs to be snappy and deliver their service without significant latency, even if other VMs are busy. The first is a objective goal regarding throughput which can be easily measured, the second a subjective one regarding latency.
Storage benchmark background
Benchmark parameters
Storage benchmarks have a few important variables:
- Test file size, which should be large enough to get past cache sizes and represent real-world usage.
- IO request size, usually between 4K and 1M, depending on the workload. Databases are more at the 4K side while moving large files is more at the 1M side.
- Access pattern, random or sequential
- Queue depth, which is the amount of IO commands that are issued by an application and queued within the controller at the same time. Depending on the drive, those commands can get executed in parallel (SSDs). Some queue saturation can be beneficial to improve performance, however too much parallelism can severely impact latency especially for HDDs.
- Distribution of write and read access based on application type. Web servers usually trigger 95% read, databases are usually 75% read and 25% write and specific applications like log servers can even use 95% write. This heavily influence how efficient caches are used, for example.
Benchmark results
A very relevant value we get out of this test is latency, which translates to IOPS which translates to throughput. As a rough example, if a IO request takes 1ms (latency) and we apply a request size of 4KiB, this means we will get 4000KiB per second (or 4MiB/s) of throughput out of this device in a perfect scenario. 1ms of latency is already very low for a HDD, which is why HDD suck at small request sizes.
When running random access on spindle storage, throughput can go down even more as read/write heads need to reposition all the time. Solid-state storage does not have that mechanical impairment. If we crank up the request size to 64KB, we suddenly get 64MB/s out of the same drive. Latency is not always the same due to storage device characteristics, especially for random access. Therefor the percentile for latency is more interesting than the average, having a 99th percentile of 1ms means that 99% of all IO requests finished within 1ms, 1% took longer. This gives an idea about consistency of latency.
Limitations
At some point with lower latency or higher request size we will hit a throughput limit defined by mechanical constraints, internal transfer to cache or by the external interface that handles transfer to the storage controller, usually 6 or 12Gb/s for SATA/SAS, 16 or 32Gb/s for PCIe. Even high-end HDDs are capped by their rotation speed, which affects both latency and throughput. Modern SSDs are usually capped by their external storage interface or by thermal issues when doing sequential access. Random access is usually limited by memory cell technology (NAND, 3D-XPoint) or controller characteristics.
Storage layout decision introduce limitations as well. When running 10 disks in mirror mode they will provide the same write performance like one disk would. Actually it depends on the slowest disk within the array. Of course drives should be matched in such arrays but there are always variances and drive performance tends to degrade over time. Running the same 10 disks as stripe, we can expect almost 10x the performance than a single drive, assuming other components can handle it. A RAIDZ1 with three disks can in theory provide the same level of performance as a single drive would. On top of checksums, ZFS will calculate parity and store it to a second drive. This means RAIDZ1 is quite CPU/Memory hungry and will occupy two disks for a single write request.
File systems itself have characteristics that impact performance. There are simple file-systems like ext2 or FAT which just put a block to disk and read it. Other systems are more advanced to avoid data loss, for example keeping a journal or creating checksums of data which got written. All those extra features require resources and can reduce file-system performance. Last but certainly not least properties like sector sizes should be aligned between file-system and physical hardware to avoid unnecessary operations like read-modify-write.
Caches
Caches are very helpful to speed up things, however they are also a disadvantage when doing benchmarks that needs to be taken into consideration. After all, we want to get results for the storage system, not system RAM or other caches. Caches are there for a reason, so they should not be disabled for benchmarking but instead real-world data and pattern needs to be used for testing.
HDDs and NAND SSDs usually have very quick but small internal cache of 128MB to 1GB. This is not just used for buffering but also internal organization, especially for SSDs which need to take care about wear leveling and compression a lot.
Some HBAs have additional caches themselves which are much larger and supports the storage array instead of individual drives.
For ZFS specifically there is a whole range of caches (ZIL, ARC, L2ARC) independently from hardware as ZFS expects to directly access drives with no “intelligent” controller in between. Their way of working could be changed but is optimized for most workloads already, however their size can and should be matched with the system configuration.
Analysis
First benchmarking
File transfers from and to the server are very unstable, bouncing between 20 and 60MB/s. Those values are not very helpful and include a lot of unnecessary moving parts (client computer, network…) so i decided to locally benchmark the VM for random and sequential read and write. To do so i chose fio
which is a handy IO benchmarking tool for Linux and other platforms.
To find out what my array is actually capable of, i started benchmarking ZFS directly at the host system. This removes several layers of indirection, which could hide potential root causes for bad performance. I also started there to find out how different benchmark settings would affect my results.
I created a matrix of benchmark settings and IOPS/throughput results and started with request sizes of 4KiB, 64KiB and 1MiB at a queue-depth of 1, 4, 8 and 16 at random read, random write, sequential read and sequential write patterns. At this point i kept my application profile simple since i was more interested in how read and write perform in general. Again reducing the complexity of having mixed workloads that could hide bottlenecks.
Results did tell me that there is negligible difference between queue-depths, so i sticked with QD4 for all future tests. Second, read performance is crazy high, indicating that ZFS caches are doing what they are supposed to do. The test first creates a data block - which ZFS stores in ARC (aka. DRAM) or L2ARC (Intel Optane 900p) - and then reads the same very same block from those caches. This is not a usual real-world scenario so i put more emphasis on write performance.
fio commands
During my benchmarks i used the following fio
parameters. Adjust block size bs
accordingly:
Pattern | Command |
---|---|
Random read | fio --filename=test --sync=1 --rw=randread --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test |
Random write | fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test |
Sequential read | fio --filename=test --sync=1 --rw=read --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test |
Sequential write | fio --filename=test --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test |
Results from ZFS
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 47464 | 190 |
4K QD4 rnd write | 10644 | 43 |
4K QD4 seq read | 347210 | 1356 |
4K QD4 seq write | 16020 | 64 |
64K QD4 rnd read | 62773 | 3923 |
64K QD4 rnd write | 5039 | 323 |
64K QD4 seq read | 58514 | 3657 |
64K QD4 seq write | 5497 | 352 |
1M QD4 rnd read | 6872 | 6872 |
1M QD4 rnd write | 645 | 661 |
1M QD4 seq read | 2348 | 2348 |
1M QD4 seq write | 664 | 680 |
Not so shabby! My system is able to do random writes up to 660MB/s on large request sizes and serve 10k IOPS on small request sizes. This gets certainly supported a lot by ZFS caches and the Optane card, but hey thats what they’re supposed to do. For a 3-disk system i’d call it a day since performance is much better than my success criteria even with default ZFS settings.
However, there still is the fact that performance within VMs is terrible and with the results so far i pretty much ruled out ZFS as the root cause. So what could it be?
Results from VM
Measuring IO within the VM confirms my impression. There is a huge gap compared to the numbers i see at the host, ranging from 85x at 4K to 6x at 1M request sizes.
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 126 | 0,5 |
4K QD4 rnd write | 124 | 0,5 |
4K QD4 seq read | 28192 | 113 |
4K QD4 seq write | 125 | 0,5 |
64K QD4 rnd read | 9626 | 616 |
64K QD4 rnd write | 126 | 8 |
64K QD4 seq read | 17925 | 1120 |
64K QD4 seq write | 126 | 8 |
1M QD4 rnd read | 1087 | 1088 |
1M QD4 rnd write | 94 | 97 |
1M QD4 seq read | 1028 | 1028 |
1M QD4 seq write | 96 | 99 |
What the heck is going on here?
Working theories
ZFS
The following parameter help to adjust ZFS behavior to a specific system. The size of ARC should be defined based on spare DRAM, in my case about 16 out of 32GB RAM are assigned to VMs, so i chose to limit ZFS ARC to 12GB. Doing that requires a Linux kernel module option, which becomes available after reloading the module.
1 | $ vim /etc/modprobe.d/zfs.conf |
I assigned a quite speedy Intel Optane 900p card as ZIL and L2ARC. By default L2ARC would be stored to the pool, which explains why there is a rather low throughput limitation of 8MB/s for it. Since the Optane card is independent from my HDD i set this to 1GB/s instead. Note that this can harm pool performance in case L2ARC is not using dedicated memory.
1 | $ vim /etc/modprobe.d/zfs.conf |
Further low-level tuning seems unnecessary until the VM comes close to the numbers seen at the host. So what can cause this? Looking at the architecture, data within VMs uses the following path:
HDDs <-> Cache <-> ZFS <-> Dataset <-> VM image <-> KVM <-> LVM <-> Encryption <-> VM file system
Dataset
Disk, ZFS and Cache are ruled out, so lets do a sanity check on my datasets. My VM images are stored on ZFS using datasets like storage/vm-100-disk-1
instead of storing them as file to the pool directly. This setup allows to specify some per-VM settings in ZFS, for example compression. One dataset property in particular made me curious:
1 | $ zfs get all storage/vm-100-disk-1 |
The volblocksize
property is relevant to align the datasets block size with the physical disks sector size. Since i’m using 4Kn disks, my sector size is 4K, not 8K - leading to a misalignment and potentially wasted storage access.
1 | $ cat /sys/block/sda/queue/hw_sector_size |
I don’t know exactly why the dataset was created with a 8K volblocksize
but since i migrated some datasets around its possible that this was set when originally creating the dataset on SSD. SSDs tend to have 8K blocks. Setting this to a aligned value just makes sense in every way:
1 | $ zfs set volblocksize=4K storage/vm-100-disk-1 |
Compression
Next up is compression. It’s common sense that compression consumes some resources and ZFS is no exception here. It already uses a quite fast and efficient default (LZ4) and i benchmarked the performance impact of switching off compression to be around 10%. Chosing this setting is really not just about speed, depending on the data it can help to severely save space and money. Bechmarks create random data which is hard to compress. I decided to keep it enabled for all datasets since ZFS already figures out if the data it writes can be compressed or not. However, for improved performance it should be disabled:
1 | $ zfs set compression=off storage/vm-100-disk-1 |
Sync
ZFS offers to make every write request to be synchronous instead of asynchronous if the issuing application choses to do so. Having synchronous write makes sure data is actually written to non-volatile memory before confirming the IO request. In case even minimal “in-flight” data loss is unacceptable, one can use sync=always
at the expense of some throughput. I found the effect on write performance to be almost 20% and since i’ve a UPS running i decided to use the default again, which allows asynchronous writes. This of course will not save me from PSU or cable failures, but i take the chance.
1 | $ zfs set sync=standard storage/vm-100-disk-1 |
atime
ZFS has the default of storing the last access time of files. In case of datasets with a RAW image inside, this does not make a lot of sense. Disabling can save a extra write after any storage request.
1 | $ zfs set atime=off storage/vm-100-disk-1 |
VM image
The RAW image of the VM is quite off the table since its just a bunch of blocks. I’d be careful with using qcow2
images on top of ZFS. ZFS already is a copy-on-write system and two levels of CoW don’t mix that well.
KVM
I manage my virtual machines using Proxmox and have chosen KVM as hypervisor. Since its emulating hardware, including mapping the RAW image to a configurable storage interface, there is a good chance to have big impact. Based on some posts i chose virtio-scsi
as storage device since i thought its discard
feature helps with moving orphaned data out of ZFS. I also chose the writeback
cache since its description sounded promising without ever testing its impact. So i played around with some options and found that virtio-block
as device and none
as cache leads to massive performance improvements! Just look at benchmark results after this change:
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 19634 | 79 |
4K QD4 rnd write | 3256 | 13 |
4K QD4 seq read | 151791 | 607 |
4K QD4 seq write | 2529 | 10 |
64K QD4 rnd read | 7922 | 507 |
64K QD4 rnd write | 909 | 58 |
64K QD4 seq read | 18044 | 1128 |
64K QD4 seq write | 1533 | 98 |
1M QD4 rnd read | 657 | 673 |
1M QD4 rnd write | 264 | 271 |
1M QD4 seq read | 805 | 824 |
1M QD4 seq write | 291 | 299 |
The iothread
option had minor but still noticeable impact as well:
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 26240 | 105 |
4K QD4 rnd write | 4011 | 16 |
4K QD4 seq read | 158395 | 634 |
4K QD4 seq write | 3067 | 12 |
64K QD4 rnd read | 10422 | 667 |
64K QD4 rnd write | 1495 | 96 |
64K QD4 seq read | 9087 | 582 |
64K QD4 seq write | 1557 | 100 |
1M QD4 rnd read | 908 | 930 |
1M QD4 rnd write | 254 | 261 |
1M QD4 seq read | 1650 | 1650 |
1M QD4 seq write | 303 | 311 |
Getting from 124 to 4011 random write IOPS at 4K is quite an impressive improvement already. Turns out that blindly tweaking ZFS/dataset properties can get you in trouble very easy. The biggest issue however was the KVM storage controller setting, which i believe has to be a bug with the controller simulation of virtio-scsi.
File systems
Next in stack would be the file system and volume manager of the virtual machine, which connects to the virtual storage device. I used Debians defaults of LVM and ext4 because defaults are always great, right? Wrong! Even tough LVM is actually just a thin layer it turned out to have quite some effect. Testing with and without LVM has shown that using a plain old GPT or no partition table (if thats an option) led to a 10% improvement. Looking at file systems, xfs and ext4 appear to be bad choices for my environment, switching to ext3 (or ext2) improved performance by another 30% in some cases!
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 30393 | 122 |
4K QD4 rnd write | 4222 | 17 |
4K QD4 seq read | 164456 | 658 |
4K QD4 seq write | 3281 | 13 |
64K QD4 rnd read | 9256 | 592 |
64K QD4 rnd write | 1813 | 116 |
64K QD4 seq read | 694 | 711 |
64K QD4 seq write | 1877 | 120 |
1M QD4 rnd read | 1207 | 1207 |
1M QD4 rnd write | 385 | 395 |
1M QD4 seq read | 1965 | 1966 |
1M QD4 seq write | 419 | 430 |
Encryption
When enabling full-disk encryption (LUKS) for the virtual drive, performance dropped a lot again. Of course thats expected to a certain degree but numbers went down below my acceptance criteria:
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 10530 | 42 |
4K QD4 rnd write | 3637 | 15 |
4K QD4 seq read | 52819 | 211 |
4K QD4 seq write | 4216 | 17 |
64K QD4 rnd read | 1710 | 109 |
64K QD4 rnd write | 1178 | 75 |
64K QD4 seq read | 3269 | 209 |
64K QD4 seq write | 1217 | 78 |
1M QD4 rnd read | 141 | 145 |
1M QD4 rnd write | 94 | 97 |
1M QD4 seq read | 155 | 159 |
1M QD4 seq write | 94 | 96 |
There actually is a catch with encryption, which is that the encryption layer tries to be as fast as possible and therefore encrypts blocks in parallel, which can mess up optimizations of writing blocks sequentially. I have not validated this in detail but in fact going single-core within the VM did show a 25% improvement on small request sizes. Anyway i don’t want to sacrifice CPU cores, especially not when doing encryption all the time. Since encryption is not really storage related, i compared encryption speed on the host and on the VM:
1 | $ cryptsetup benchmark |
Quite consistent results, however looking at the host did reveal a different truth:
1 | $ cryptsetup benchmark |
Numbers for AES based algorithms are through the roof on the host. The reason for this is a native AES implementation on recent Intel CPUs called AES-NI. Proxmox defaults the KVM “CPU model” to “kvm64”, which does not pass through AES-NI. Using host
CPU type exposes the CPU directly to the VM which led to a huge boost again. Note that this might be a security risk on shared systems. In my case i’m in full control of the system so it does not matter. So lets check the final results:
Pattern | IOPS | MB/s |
---|---|---|
4K QD4 rnd read | 26449 | 106 |
4K QD4 rnd write | 6308 | 25 |
4K QD4 seq read | 158490 | 634 |
4K QD4 seq write | 6387 | 26 |
64K QD4 rnd read | 9092 | 582 |
64K QD4 rnd write | 2317 | 148 |
64K QD4 seq read | 17847 | 1116 |
64K QD4 seq write | 2308 | 148 |
1M QD4 rnd read | 454 | 466 |
1M QD4 rnd write | 240 | 246 |
1M QD4 seq read | 806 | 826 |
1M QD4 seq write | 223 | 229 |
Finally my VM is reaching the goal of saturating a 1Gbps link. 150 - 250MB/s random write on 3 disks while using encryption and compression is pretty neat!
Lessons learned
- Always question and validate changes done to complex systems
- Use virtio-blk, host CPU, iothread and no storage cache on KVM
- Make sure dataset block size is aligned to hardware
- Consider disabling compression, and access time on datasets
- Avoid using LVM within VMs, consider ext3 over ext4