ZFS performance tuning

A generic piece of advice on tuning

ZFS is a mature piece of software, engineered by file- and storage-system experts with lots of knowledge from practical experience. Sun invested a lot of money and built enterprise grade appliances around it for a decade. Always keep this in mind when optimizing it, there is reason to trust that defaults are chosen very reasonably even though they may not appear obvious at first look. Mind that enterprise class storage systems are primarily about safety and not about speed at all cost. If there would be settings that were better in all regards, someone would have already figured out and made them the default.

That being said, there is usually a chance to optimize software in order to fit the specific workload and environment. Doing so can save a lot of money compared to throwing more expensive hardware on a problem. This process is based on knowledge, context, testing methodology and goals. Always verify that a change has actual holistic positive impact and adjust back if not. Optimizing complex systems requires a systematic approach, which is not pasting every setting that has been suggested on the internet. It’s very likely that random settings which worked for a specific case won’t yield any improvement for another case but instead introduce problems or even data loss. The same applies to any suggestion given at this article, it could be totally worthless for you to replicate settings that worked well for me.

Before actually changing anything, make sure you understand the underlying concepts, have read or listened to all relevant documentation and are in a position to second-guess suggestions made by others. Make sure you understand the context in which ZFS is operating and define plausible success criteria. It does not make sense to aim for 2000 IOPS at 4K blocks out of a single HDD or expect 1GB/s throughput on encrypted storage on a Raspberry. It’s also not useful to expect the same kind of performance for any given workload since each configuration and optimization stands for itself. If you don’t know how block storage works in general or what parameters are relevant to measure and rate storage systems, then please gather this knowledge. Only if you can say with confidence that you understand what you are doing, why you are doing it, what you roughly expect and have found a proper testing methodology, only then you should attempt to “tune” a complex system such as ZFS.

Context

I’m using a 3-way RAIDZ1 array with HGST HUH721010ALN600 disks (10TB, 7200rpm, 4Kn) and a Intel Optane 900p card as ZIL/L2ARC within a entry-level server (E3-1260L, 32GB, 2x1Gbps) running Debian Linux and Proxmox/KVM for virtualization. Virtual machines (currently 10) run headless Debian Linux and provide general purpose residential services such as Mail, File, Web, VPN, Authentication, Monitoring etc. This article was written while running ZFS on Linux “ZoL” 0.7.6.

Current situation

Storage access within VMs is terribly slow and the host system shows high on IOwait numbers. Especially encrypted disks almost flat-line when moving some data around.

Defining success criteria

My goal is to fully saturate one of the servers 1Gbps links with a 10GB file transfer from within a virtual machine doing full-disk encryption. I want my VMs to be snappy and deliver their service without significant latency, even if other VMs are busy. The first is a objective goal regarding throughput which can be easily measured, the second a subjective one regarding latency.

Storage benchmark background

Benchmark parameters

Storage benchmarks have a few important variables:

  • Test file size, which should be large enough to get past cache sizes and represent real-world usage.
  • IO request size, usually between 4K and 1M, depending on the workload. Databases are more at the 4K side while moving large files is more at the 1M side.
  • Access pattern, random or sequential
  • Queue depth, which is the amount of IO commands that are issued by an application and queued within the controller at the same time. Depending on the drive, those commands can get executed in parallel (SSDs). Some queue saturation can be beneficial to improve performance, however too much parallelism can severely impact latency especially for HDDs.
  • Distribution of write and read access based on application type. Web servers usually trigger 95% read, databases are usually 75% read and 25% write and specific applications like log servers can even use 95% write. This heavily influence how efficient caches are used, for example.

Benchmark results

A very relevant value we get out of this test is latency, which translates to IOPS which translates to throughput. As a rough example, if a IO request takes 1ms (latency) and we apply a request size of 4KiB, this means we will get 4000KiB per second (or 4MiB/s) of throughput out of this device in a perfect scenario. 1ms of latency is already very low for a HDD, which is why HDD suck at small request sizes.

When running random access on spindle storage, throughput can go down even more as read/write heads need to reposition all the time. Solid-state storage does not have that mechanical impairment. If we crank up the request size to 64KB, we suddenly get 64MB/s out of the same drive. Latency is not always the same due to storage device characteristics, especially for random access. Therefor the percentile for latency is more interesting than the average, having a 99th percentile of 1ms means that 99% of all IO requests finished within 1ms, 1% took longer. This gives an idea about consistency of latency.

Limitations

At some point with lower latency or higher request size we will hit a throughput limit defined by mechanical constraints, internal transfer to cache or by the external interface that handles transfer to the storage controller, usually 6 or 12Gb/s for SATA/SAS, 16 or 32Gb/s for PCIe. Even high-end HDDs are capped by their rotation speed, which affects both latency and throughput. Modern SSDs are usually capped by their external storage interface or by thermal issues when doing sequential access. Random access is usually limited by memory cell technology (NAND, 3D-XPoint) or controller characteristics.

Storage layout decision introduce limitations as well. When running 10 disks in mirror mode they will provide the same write performance like one disk would. Actually it depends on the slowest disk within the array. Of course drives should be matched in such arrays but there are always variances and drive performance tends to degrade over time. Running the same 10 disks as stripe, we can expect almost 10x the performance than a single drive, assuming other components can handle it. A RAIDZ1 with three disks can in theory provide the same level of performance as a single drive would. On top of checksums, ZFS will calculate parity and store it to a second drive. This means RAIDZ1 is quite CPU/Memory hungry and will occupy two disks for a single write request.

File systems itself have characteristics that impact performance. There are simple file-systems like ext2 or FAT which just put a block to disk and read it. Other systems are more advanced to avoid data loss, for example keeping a journal or creating checksums of data which got written. All those extra features require resources and can reduce file-system performance. Last but certainly not least properties like sector sizes should be aligned between file-system and physical hardware to avoid unnecessary operations like read-modify-write.

Caches

Caches are very helpful to speed up things, however they are also a disadvantage when doing benchmarks that needs to be taken into consideration. After all, we want to get results for the storage system, not system RAM or other caches. Caches are there for a reason, so they should not be disabled for benchmarking but instead real-world data and pattern needs to be used for testing.

HDDs and NAND SSDs usually have very quick but small internal cache of 128MB to 1GB. This is not just used for buffering but also internal organization, especially for SSDs which need to take care about wear leveling and compression a lot.
Some HBAs have additional caches themselves which are much larger and supports the storage array instead of individual drives.
For ZFS specifically there is a whole range of caches (ZIL, ARC, L2ARC) independently from hardware as ZFS expects to directly access drives with no “intelligent” controller in between. Their way of working could be changed but is optimized for most workloads already, however their size can and should be matched with the system configuration.

Analysis

First benchmarking

File transfers from and to the server are very unstable, bouncing between 20 and 60MB/s. Those values are not very helpful and include a lot of unnecessary moving parts (client computer, network…) so i decided to locally benchmark the VM for random and sequential read and write. To do so i chose fio which is a handy IO benchmarking tool for Linux and other platforms.

To find out what my array is actually capable of, i started benchmarking ZFS directly at the host system. This removes several layers of indirection, which could hide potential root causes for bad performance. I also started there to find out how different benchmark settings would affect my results.

I created a matrix of benchmark settings and IOPS/throughput results and started with request sizes of 4KiB, 64KiB and 1MiB at a queue-depth of 1, 4, 8 and 16 at random read, random write, sequential read and sequential write patterns. At this point i kept my application profile simple since i was more interested in how read and write perform in general. Again reducing the complexity of having mixed workloads that could hide bottlenecks.

Results did tell me that there is negligible difference between queue-depths, so i sticked with QD4 for all future tests. Second, read performance is crazy high, indicating that ZFS caches are doing what they are supposed to do. The test first creates a data block - which ZFS stores in ARC (aka. DRAM) or L2ARC (Intel Optane 900p) - and then reads the same very same block from those caches. This is not a usual real-world scenario so i put more emphasis on write performance.

fio commands

During my benchmarks i used the following fio parameters. Adjust block size bs accordingly:

Pattern Command
Random read fio --filename=test --sync=1 --rw=randread --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Random write fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Sequential read fio --filename=test --sync=1 --rw=read --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Sequential write fio --filename=test --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Results from ZFS

Pattern IOPS MB/s
4K QD4 rnd read 47464 190
4K QD4 rnd write 10644 43
4K QD4 seq read 347210 1356
4K QD4 seq write 16020 64
64K QD4 rnd read 62773 3923
64K QD4 rnd write 5039 323
64K QD4 seq read 58514 3657
64K QD4 seq write 5497 352
1M QD4 rnd read 6872 6872
1M QD4 rnd write 645 661
1M QD4 seq read 2348 2348
1M QD4 seq write 664 680

Not so shabby! My system is able to do random writes up to 660MB/s on large request sizes and serve 10k IOPS on small request sizes. This gets certainly supported a lot by ZFS caches and the Optane card, but hey thats what they’re supposed to do. For a 3-disk system i’d call it a day since performance is much better than my success criteria even with default ZFS settings.

However, there still is the fact that performance within VMs is terrible and with the results so far i pretty much ruled out ZFS as the root cause. So what could it be?

Results from VM

Measuring IO within the VM confirms my impression. There is a huge gap compared to the numbers i see at the host, ranging from 85x at 4K to 6x at 1M request sizes.

Pattern IOPS MB/s
4K QD4 rnd read 126 0,5
4K QD4 rnd write 124 0,5
4K QD4 seq read 28192 113
4K QD4 seq write 125 0,5
64K QD4 rnd read 9626 616
64K QD4 rnd write 126 8
64K QD4 seq read 17925 1120
64K QD4 seq write 126 8
1M QD4 rnd read 1087 1088
1M QD4 rnd write 94 97
1M QD4 seq read 1028 1028
1M QD4 seq write 96 99

What the heck is going on here?

Working theories

ZFS

The following parameter help to adjust ZFS behavior to a specific system. The size of ARC should be defined based on spare DRAM, in my case about 16 out of 32GB RAM are assigned to VMs, so i chose to limit ZFS ARC to 12GB. Doing that requires a Linux kernel module option, which becomes available after reloading the module.

1
2
$ vim /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=12884901888

I assigned a quite speedy Intel Optane 900p card as ZIL and L2ARC. By default L2ARC would be stored to the pool, which explains why there is a rather low throughput limitation of 8MB/s for it. Since the Optane card is independent from my HDD i set this to 1GB/s instead. Note that this can harm pool performance in case L2ARC is not using dedicated memory.

1
2
$ vim /etc/modprobe.d/zfs.conf
options zfs l2arc_write_max=1048576000

Further low-level tuning seems unnecessary until the VM comes close to the numbers seen at the host. So what can cause this? Looking at the architecture, data within VMs uses the following path:

HDDs <-> Cache <-> ZFS <-> Dataset <-> VM image <-> KVM <-> LVM <-> Encryption <-> VM file system

Dataset

Disk, ZFS and Cache are ruled out, so lets do a sanity check on my datasets. My VM images are stored on ZFS using datasets like storage/vm-100-disk-1 instead of storing them as file to the pool directly. This setup allows to specify some per-VM settings in ZFS, for example compression. One dataset property in particular made me curious:

1
2
3
4
$ zfs get all storage/vm-100-disk-1
storage/vm-100-disk-1 volsize 10G local
storage/vm-100-disk-1 volblocksize 8K -
storage/vm-100-disk-1 checksum on default

The volblocksize property is relevant to align the datasets block size with the physical disks sector size. Since i’m using 4Kn disks, my sector size is 4K, not 8K - leading to a misalignment and potentially wasted storage access.

1
2
$ cat /sys/block/sda/queue/hw_sector_size
4096

I don’t know exactly why the dataset was created with a 8K volblocksize but since i migrated some datasets around its possible that this was set when originally creating the dataset on SSD. SSDs tend to have 8K blocks. Setting this to a aligned value just makes sense in every way:

1
$ zfs set volblocksize=4K storage/vm-100-disk-1

Compression

Next up is compression. It’s common sense that compression consumes some resources and ZFS is no exception here. It already uses a quite fast and efficient default (LZ4) and i benchmarked the performance impact of switching off compression to be around 10%. Chosing this setting is really not just about speed, depending on the data it can help to severely save space and money. Bechmarks create random data which is hard to compress. I decided to keep it enabled for all datasets since ZFS already figures out if the data it writes can be compressed or not. However, for improved performance it should be disabled:

1
$ zfs set compression=off storage/vm-100-disk-1

Sync

ZFS offers to make every write request to be synchronous instead of asynchronous if the issuing application choses to do so. Having synchronous write makes sure data is actually written to non-volatile memory before confirming the IO request. In case even minimal “in-flight” data loss is unacceptable, one can use sync=always at the expense of some throughput. I found the effect on write performance to be almost 20% and since i’ve a UPS running i decided to use the default again, which allows asynchronous writes. This of course will not save me from PSU or cable failures, but i take the chance.

1
$ zfs set sync=standard storage/vm-100-disk-1

atime

ZFS has the default of storing the last access time of files. In case of datasets with a RAW image inside, this does not make a lot of sense. Disabling can save a extra write after any storage request.

1
$ zfs set atime=off storage/vm-100-disk-1

VM image

The RAW image of the VM is quite off the table since its just a bunch of blocks. I’d be careful with using qcow2 images on top of ZFS. ZFS already is a copy-on-write system and two levels of CoW don’t mix that well.

KVM

I manage my virtual machines using Proxmox and have chosen KVM as hypervisor. Since its emulating hardware, including mapping the RAW image to a configurable storage interface, there is a good chance to have big impact. Based on some posts i chose virtio-scsi as storage device since i thought its discard feature helps with moving orphaned data out of ZFS. I also chose the writeback cache since its description sounded promising without ever testing its impact. So i played around with some options and found that virtio-block as device and none as cache leads to massive performance improvements! Just look at benchmark results after this change:

Pattern IOPS MB/s
4K QD4 rnd read 19634 79
4K QD4 rnd write 3256 13
4K QD4 seq read 151791 607
4K QD4 seq write 2529 10
64K QD4 rnd read 7922 507
64K QD4 rnd write 909 58
64K QD4 seq read 18044 1128
64K QD4 seq write 1533 98
1M QD4 rnd read 657 673
1M QD4 rnd write 264 271
1M QD4 seq read 805 824
1M QD4 seq write 291 299

The iothread option had minor but still noticeable impact as well:

Pattern IOPS MB/s
4K QD4 rnd read 26240 105
4K QD4 rnd write 4011 16
4K QD4 seq read 158395 634
4K QD4 seq write 3067 12
64K QD4 rnd read 10422 667
64K QD4 rnd write 1495 96
64K QD4 seq read 9087 582
64K QD4 seq write 1557 100
1M QD4 rnd read 908 930
1M QD4 rnd write 254 261
1M QD4 seq read 1650 1650
1M QD4 seq write 303 311

Getting from 124 to 4011 random write IOPS at 4K is quite an impressive improvement already. Turns out that blindly tweaking ZFS/dataset properties can get you in trouble very easy. The biggest issue however was the KVM storage controller setting, which i believe has to be a bug with the controller simulation of virtio-scsi.

File systems

Next in stack would be the file system and volume manager of the virtual machine, which connects to the virtual storage device. I used Debians defaults of LVM and ext4 because defaults are always great, right? Wrong! Even tough LVM is actually just a thin layer it turned out to have quite some effect. Testing with and without LVM has shown that using a plain old GPT or no partition table (if thats an option) led to a 10% improvement. Looking at file systems, xfs and ext4 appear to be bad choices for my environment, switching to ext3 (or ext2) improved performance by another 30% in some cases!

Pattern IOPS MB/s
4K QD4 rnd read 30393 122
4K QD4 rnd write 4222 17
4K QD4 seq read 164456 658
4K QD4 seq write 3281 13
64K QD4 rnd read 9256 592
64K QD4 rnd write 1813 116
64K QD4 seq read 694 711
64K QD4 seq write 1877 120
1M QD4 rnd read 1207 1207
1M QD4 rnd write 385 395
1M QD4 seq read 1965 1966
1M QD4 seq write 419 430

Encryption

When enabling full-disk encryption (LUKS) for the virtual drive, performance dropped a lot again. Of course thats expected to a certain degree but numbers went down below my acceptance criteria:

Pattern IOPS MB/s
4K QD4 rnd read 10530 42
4K QD4 rnd write 3637 15
4K QD4 seq read 52819 211
4K QD4 seq write 4216 17
64K QD4 rnd read 1710 109
64K QD4 rnd write 1178 75
64K QD4 seq read 3269 209
64K QD4 seq write 1217 78
1M QD4 rnd read 141 145
1M QD4 rnd write 94 97
1M QD4 seq read 155 159
1M QD4 seq write 94 96

There actually is a catch with encryption, which is that the encryption layer tries to be as fast as possible and therefore encrypts blocks in parallel, which can mess up optimizations of writing blocks sequentially. I have not validated this in detail but in fact going single-core within the VM did show a 25% improvement on small request sizes. Anyway i don’t want to sacrifice CPU cores, especially not when doing encryption all the time. Since encryption is not really storage related, i compared encryption speed on the host and on the VM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cryptsetup benchmark
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 207.6 MiB/s 243.0 MiB/s
serpent-cbc 128b 82.0 MiB/s 310.6 MiB/s
twofish-cbc 128b 168.7 MiB/s 192.0 MiB/s
aes-cbc 256b 191.4 MiB/s 199.6 MiB/s
serpent-cbc 256b 88.3 MiB/s 278.8 MiB/s
twofish-cbc 256b 151.6 MiB/s 171.5 MiB/s
aes-xts 256b 266.2 MiB/s 251.4 MiB/s
serpent-xts 256b 286.3 MiB/s 285.9 MiB/s
twofish-xts 256b 191.7 MiB/s 195.6 MiB/s
aes-xts 512b 201.8 MiB/s 197.8 MiB/s
serpent-xts 512b 276.3 MiB/s 261.3 MiB/s
twofish-xts 512b 187.0 MiB/s 185.7 MiB/s

Quite consistent results, however looking at the host did reveal a different truth:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cryptsetup benchmark
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 1036.2 MiB/s 3206.6 MiB/s
serpent-cbc 128b 83.9 MiB/s 658.9 MiB/s
twofish-cbc 128b 192.5 MiB/s 316.4 MiB/s
aes-cbc 256b 767.6 MiB/s 2538.9 MiB/s
serpent-cbc 256b 83.9 MiB/s 657.0 MiB/s
twofish-cbc 256b 198.2 MiB/s 356.7 MiB/s
aes-xts 256b 3152.5 MiB/s 3165.3 MiB/s
serpent-xts 256b 612.8 MiB/s 541.7 MiB/s
twofish-xts 256b 343.1 MiB/s 351.5 MiB/s
aes-xts 512b 2361.9 MiB/s 2483.2 MiB/s
serpent-xts 512b 632.8 MiB/s 622.9 MiB/s
twofish-xts 512b 349.5 MiB/s 352.1 MiB/s

Numbers for AES based algorithms are through the roof on the host. The reason for this is a native AES implementation on recent Intel CPUs called AES-NI. Proxmox defaults the KVM “CPU model” to “kvm64”, which does not pass through AES-NI. Using host CPU type exposes the CPU directly to the VM which led to a huge boost again. Note that this might be a security risk on shared systems. In my case i’m in full control of the system so it does not matter. So lets check the final results:

Pattern IOPS MB/s
4K QD4 rnd read 26449 106
4K QD4 rnd write 6308 25
4K QD4 seq read 158490 634
4K QD4 seq write 6387 26
64K QD4 rnd read 9092 582
64K QD4 rnd write 2317 148
64K QD4 seq read 17847 1116
64K QD4 seq write 2308 148
1M QD4 rnd read 454 466
1M QD4 rnd write 240 246
1M QD4 seq read 806 826
1M QD4 seq write 223 229

Finally my VM is reaching the goal of saturating a 1Gbps link. 150 - 250MB/s random write on 3 disks while using encryption and compression is pretty neat!

Lessons learned

  1. Always question and validate changes done to complex systems
  2. Use virtio-blk, host CPU, iothread and no storage cache on KVM
  3. Make sure dataset block size is aligned to hardware
  4. Consider disabling compression, and access time on datasets
  5. Avoid using LVM within VMs, consider ext3 over ext4
Customer Data For All Mankind Home server upgrades, ZFS, Optane and more...
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×