ZFS performance tuning

Feb 23 2018 25 minutes read (About 3817 words)

A generic piece of advice on tuning

ZFS is a mature piece of software, engineered by file- and storage-system experts with lots of knowledge from practical experience. Sun invested a lot of money and built enterprise grade appliances around it for a decade. Always keep this in mind when optimizing it, there is reason to trust that defaults are chosen very reasonably even though they may not appear obvious at first look. Mind that enterprise class storage systems are primarily about safety and not about speed at all cost. If there would be settings that were better in all regards, someone would have already figured out and made them the default.

That being said, there is usually a chance to optimize software in order to fit the specific workload and environment. Doing so can save a lot of money compared to throwing more expensive hardware on a problem. This process is based on knowledge, context, testing methodology and goals. Always verify that a change has actual holistic positive impact and adjust back if not. Optimizing complex systems requires a systematic approach, which is not pasting every setting that has been suggested on the internet. It’s very likely that random settings which worked for a specific case won’t yield any improvement for another case but instead introduce problems or even data loss. The same applies to any suggestion given at this article, it could be totally worthless for you to replicate settings that worked well for me.

Before actually changing anything, make sure you understand the underlying concepts, have read or listened to all relevant documentation and are in a position to second-guess suggestions made by others. Make sure you understand the context in which ZFS is operating and define plausible success criteria. It does not make sense to aim for 2000 IOPS at 4K blocks out of a single HDD or expect 1GB/s throughput on encrypted storage on a Raspberry. It’s also not useful to expect the same kind of performance for any given workload since each configuration and optimization stands for itself. If you don’t know how block storage works in general or what parameters are relevant to measure and rate storage systems, then please gather this knowledge. Only if you can say with confidence that you understand what you are doing, why you are doing it, what you roughly expect and have found a proper testing methodology, only then you should attempt to “tune” a complex system such as ZFS.

Context

I’m using a 3-way RAIDZ1 array with HGST HUH721010ALN600 disks (10TB, 7200rpm, 4Kn) and a Intel Optane 900p card as ZIL/L2ARC within a entry-level server (E3-1260L, 32GB, 2x1Gbps) running Debian Linux and Proxmox/KVM for virtualization. Virtual machines (currently 10) run headless Debian Linux and provide general purpose residential services such as Mail, File, Web, VPN, Authentication, Monitoring etc. This article was written while running ZFS on Linux “ZoL” 0.7.6.

Current situation

Storage access within VMs is terribly slow and the host system shows high on IOwait numbers. Especially encrypted disks almost flat-line when moving some data around.

Defining success criteria

My goal is to fully saturate one of the servers 1Gbps links with a 10GB file transfer from within a virtual machine doing full-disk encryption. I want my VMs to be snappy and deliver their service without significant latency, even if other VMs are busy. The first is a objective goal regarding throughput which can be easily measured, the second a subjective one regarding latency.

Storage benchmark background

Benchmark parameters

Storage benchmarks have a few important variables:

Test file size, which should be large enough to get past cache sizes and represent real-world usage.
IO request size, usually between 4K and 1M, depending on the workload. Databases are more at the 4K side while moving large files is more at the 1M side.
Access pattern, random or sequential
Queue depth, which is the amount of IO commands that are issued by an application and queued within the controller at the same time. Depending on the drive, those commands can get executed in parallel (SSDs). Some queue saturation can be beneficial to improve performance, however too much parallelism can severely impact latency especially for HDDs.
Distribution of write and read access based on application type. Web servers usually trigger 95% read, databases are usually 75% read and 25% write and specific applications like log servers can even use 95% write. This heavily influence how efficient caches are used, for example.

Benchmark results

A very relevant value we get out of this test is latency, which translates to IOPS which translates to throughput. As a rough example, if a IO request takes 1ms (latency) and we apply a request size of 4KiB, this means we will get 4000KiB per second (or 4MiB/s) of throughput out of this device in a perfect scenario. 1ms of latency is already very low for a HDD, which is why HDD suck at small request sizes.

When running random access on spindle storage, throughput can go down even more as read/write heads need to reposition all the time. Solid-state storage does not have that mechanical impairment. If we crank up the request size to 64KB, we suddenly get 64MB/s out of the same drive. Latency is not always the same due to storage device characteristics, especially for random access. Therefor the percentile for latency is more interesting than the average, having a 99th percentile of 1ms means that 99% of all IO requests finished within 1ms, 1% took longer. This gives an idea about consistency of latency.

Limitations

At some point with lower latency or higher request size we will hit a throughput limit defined by mechanical constraints, internal transfer to cache or by the external interface that handles transfer to the storage controller, usually 6 or 12Gb/s for SATA/SAS, 16 or 32Gb/s for PCIe. Even high-end HDDs are capped by their rotation speed, which affects both latency and throughput. Modern SSDs are usually capped by their external storage interface or by thermal issues when doing sequential access. Random access is usually limited by memory cell technology (NAND, 3D-XPoint) or controller characteristics.

Storage layout decision introduce limitations as well. When running 10 disks in mirror mode they will provide the same write performance like one disk would. Actually it depends on the slowest disk within the array. Of course drives should be matched in such arrays but there are always variances and drive performance tends to degrade over time. Running the same 10 disks as stripe, we can expect almost 10x the performance than a single drive, assuming other components can handle it. A RAIDZ1 with three disks can in theory provide the same level of performance as a single drive would. On top of checksums, ZFS will calculate parity and store it to a second drive. This means RAIDZ1 is quite CPU/Memory hungry and will occupy two disks for a single write request.

File systems itself have characteristics that impact performance. There are simple file-systems like ext2 or FAT which just put a block to disk and read it. Other systems are more advanced to avoid data loss, for example keeping a journal or creating checksums of data which got written. All those extra features require resources and can reduce file-system performance. Last but certainly not least properties like sector sizes should be aligned between file-system and physical hardware to avoid unnecessary operations like read-modify-write.

Caches

Caches are very helpful to speed up things, however they are also a disadvantage when doing benchmarks that needs to be taken into consideration. After all, we want to get results for the storage system, not system RAM or other caches. Caches are there for a reason, so they should not be disabled for benchmarking but instead real-world data and pattern needs to be used for testing.

HDDs and NAND SSDs usually have very quick but small internal cache of 128MB to 1GB. This is not just used for buffering but also internal organization, especially for SSDs which need to take care about wear leveling and compression a lot.
Some HBAs have additional caches themselves which are much larger and supports the storage array instead of individual drives.
For ZFS specifically there is a whole range of caches (ZIL, ARC, L2ARC) independently from hardware as ZFS expects to directly access drives with no “intelligent” controller in between. Their way of working could be changed but is optimized for most workloads already, however their size can and should be matched with the system configuration.

Analysis

First benchmarking

File transfers from and to the server are very unstable, bouncing between 20 and 60MB/s. Those values are not very helpful and include a lot of unnecessary moving parts (client computer, network…) so i decided to locally benchmark the VM for random and sequential read and write. To do so i chose fio which is a handy IO benchmarking tool for Linux and other platforms.

To find out what my array is actually capable of, i started benchmarking ZFS directly at the host system. This removes several layers of indirection, which could hide potential root causes for bad performance. I also started there to find out how different benchmark settings would affect my results.

I created a matrix of benchmark settings and IOPS/throughput results and started with request sizes of 4KiB, 64KiB and 1MiB at a queue-depth of 1, 4, 8 and 16 at random read, random write, sequential read and sequential write patterns. At this point i kept my application profile simple since i was more interested in how read and write perform in general. Again reducing the complexity of having mixed workloads that could hide bottlenecks.

Results did tell me that there is negligible difference between queue-depths, so i sticked with QD4 for all future tests. Second, read performance is crazy high, indicating that ZFS caches are doing what they are supposed to do. The test first creates a data block - which ZFS stores in ARC (aka. DRAM) or L2ARC (Intel Optane 900p) - and then reads the same very same block from those caches. This is not a usual real-world scenario so i put more emphasis on write performance.

fio commands

During my benchmarks i used the following fio parameters. Adjust block size bs accordingly:

Pattern	Command
Random read	`fio --filename=test --sync=1 --rw=randread --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test`
Random write	`fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test`
Sequential read	`fio --filename=test --sync=1 --rw=read --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test`
Sequential write	`fio --filename=test --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test`

Results from ZFS

Pattern	IOPS	MB/s
4K QD4 rnd read	47464	190
4K QD4 rnd write	10644	43
4K QD4 seq read	347210	1356
4K QD4 seq write	16020	64
64K QD4 rnd read	62773	3923
64K QD4 rnd write	5039	323
64K QD4 seq read	58514	3657
64K QD4 seq write	5497	352
1M QD4 rnd read	6872	6872
1M QD4 rnd write	645	661
1M QD4 seq read	2348	2348
1M QD4 seq write	664	680

Not so shabby! My system is able to do random writes up to 660MB/s on large request sizes and serve 10k IOPS on small request sizes. This gets certainly supported a lot by ZFS caches and the Optane card, but hey thats what they’re supposed to do. For a 3-disk system i’d call it a day since performance is much better than my success criteria even with default ZFS settings.

However, there still is the fact that performance within VMs is terrible and with the results so far i pretty much ruled out ZFS as the root cause. So what could it be?

Results from VM

Measuring IO within the VM confirms my impression. There is a huge gap compared to the numbers i see at the host, ranging from 85x at 4K to 6x at 1M request sizes.

Pattern	IOPS	MB/s
4K QD4 rnd read	126	0,5
4K QD4 rnd write	124	0,5
4K QD4 seq read	28192	113
4K QD4 seq write	125	0,5
64K QD4 rnd read	9626	616
64K QD4 rnd write	126	8
64K QD4 seq read	17925	1120
64K QD4 seq write	126	8
1M QD4 rnd read	1087	1088
1M QD4 rnd write	94	97
1M QD4 seq read	1028	1028
1M QD4 seq write	96	99

What the heck is going on here?

Working theories

ZFS

The following parameter help to adjust ZFS behavior to a specific system. The size of ARC should be defined based on spare DRAM, in my case about 16 out of 32GB RAM are assigned to VMs, so i chose to limit ZFS ARC to 12GB. Doing that requires a Linux kernel module option, which becomes available after reloading the module.

1 2	$ vim /etc/modprobe.d/zfs.conf options zfs zfs_arc_max=12884901888

I assigned a quite speedy Intel Optane 900p card as ZIL and L2ARC. By default L2ARC would be stored to the pool, which explains why there is a rather low throughput limitation of 8MB/s for it. Since the Optane card is independent from my HDD i set this to 1GB/s instead. Note that this can harm pool performance in case L2ARC is not using dedicated memory.

1 2	$ vim /etc/modprobe.d/zfs.conf options zfs l2arc_write_max=1048576000

Further low-level tuning seems unnecessary until the VM comes close to the numbers seen at the host. So what can cause this? Looking at the architecture, data within VMs uses the following path:

HDDs <-> Cache <-> ZFS <-> Dataset <-> VM image <-> KVM <-> LVM <-> Encryption <-> VM file system

Dataset

Disk, ZFS and Cache are ruled out, so lets do a sanity check on my datasets. My VM images are stored on ZFS using datasets like storage/vm-100-disk-1 instead of storing them as file to the pool directly. This setup allows to specify some per-VM settings in ZFS, for example compression. One dataset property in particular made me curious:

$ zfs get all storage/vm-100-disk-1
storage/vm-100-disk-1  volsize               10G                    local
storage/vm-100-disk-1  volblocksize          8K                     -
storage/vm-100-disk-1  checksum              on                     default

The volblocksize property is relevant to align the datasets block size with the physical disks sector size. Since i’m using 4Kn disks, my sector size is 4K, not 8K - leading to a misalignment and potentially wasted storage access.

1 2	$ cat /sys/block/sda/queue/hw_sector_size 4096

I don’t know exactly why the dataset was created with a 8K volblocksize but since i migrated some datasets around its possible that this was set when originally creating the dataset on SSD. SSDs tend to have 8K blocks. Setting this to a aligned value just makes sense in every way:

1	$ zfs set volblocksize=4K storage/vm-100-disk-1

Compression

Next up is compression. It’s common sense that compression consumes some resources and ZFS is no exception here. It already uses a quite fast and efficient default (LZ4) and i benchmarked the performance impact of switching off compression to be around 10%. Chosing this setting is really not just about speed, depending on the data it can help to severely save space and money. Bechmarks create random data which is hard to compress. I decided to keep it enabled for all datasets since ZFS already figures out if the data it writes can be compressed or not. However, for improved performance it should be disabled:

1	$ zfs set compression=off storage/vm-100-disk-1

Sync

ZFS offers to make every write request to be synchronous instead of asynchronous if the issuing application choses to do so. Having synchronous write makes sure data is actually written to non-volatile memory before confirming the IO request. In case even minimal “in-flight” data loss is unacceptable, one can use sync=always at the expense of some throughput. I found the effect on write performance to be almost 20% and since i’ve a UPS running i decided to use the default again, which allows asynchronous writes. This of course will not save me from PSU or cable failures, but i take the chance.

1	$ zfs set sync=standard storage/vm-100-disk-1

atime

ZFS has the default of storing the last access time of files. In case of datasets with a RAW image inside, this does not make a lot of sense. Disabling can save a extra write after any storage request.

1	$ zfs set atime=off storage/vm-100-disk-1

VM image

The RAW image of the VM is quite off the table since its just a bunch of blocks. I’d be careful with using qcow2 images on top of ZFS. ZFS already is a copy-on-write system and two levels of CoW don’t mix that well.

KVM

I manage my virtual machines using Proxmox and have chosen KVM as hypervisor. Since its emulating hardware, including mapping the RAW image to a configurable storage interface, there is a good chance to have big impact. Based on some posts i chose virtio-scsi as storage device since i thought its discard feature helps with moving orphaned data out of ZFS. I also chose the writeback cache since its description sounded promising without ever testing its impact. So i played around with some options and found that virtio-block as device and none as cache leads to massive performance improvements! Just look at benchmark results after this change:

Pattern	IOPS	MB/s
4K QD4 rnd read	19634	79
4K QD4 rnd write	3256	13
4K QD4 seq read	151791	607
4K QD4 seq write	2529	10
64K QD4 rnd read	7922	507
64K QD4 rnd write	909	58
64K QD4 seq read	18044	1128
64K QD4 seq write	1533	98
1M QD4 rnd read	657	673
1M QD4 rnd write	264	271
1M QD4 seq read	805	824
1M QD4 seq write	291	299

The iothread option had minor but still noticeable impact as well:

Pattern	IOPS	MB/s
4K QD4 rnd read	26240	105
4K QD4 rnd write	4011	16
4K QD4 seq read	158395	634
4K QD4 seq write	3067	12
64K QD4 rnd read	10422	667
64K QD4 rnd write	1495	96
64K QD4 seq read	9087	582
64K QD4 seq write	1557	100
1M QD4 rnd read	908	930
1M QD4 rnd write	254	261
1M QD4 seq read	1650	1650
1M QD4 seq write	303	311

Getting from 124 to 4011 random write IOPS at 4K is quite an impressive improvement already. Turns out that blindly tweaking ZFS/dataset properties can get you in trouble very easy. The biggest issue however was the KVM storage controller setting, which i believe has to be a bug with the controller simulation of virtio-scsi.

File systems

Next in stack would be the file system and volume manager of the virtual machine, which connects to the virtual storage device. I used Debians defaults of LVM and ext4 because defaults are always great, right? Wrong! Even tough LVM is actually just a thin layer it turned out to have quite some effect. Testing with and without LVM has shown that using a plain old GPT or no partition table (if thats an option) led to a 10% improvement. Looking at file systems, xfs and ext4 appear to be bad choices for my environment, switching to ext3 (or ext2) improved performance by another 30% in some cases!

Pattern	IOPS	MB/s
4K QD4 rnd read	30393	122
4K QD4 rnd write	4222	17
4K QD4 seq read	164456	658
4K QD4 seq write	3281	13
64K QD4 rnd read	9256	592
64K QD4 rnd write	1813	116
64K QD4 seq read	694	711
64K QD4 seq write	1877	120
1M QD4 rnd read	1207	1207
1M QD4 rnd write	385	395
1M QD4 seq read	1965	1966
1M QD4 seq write	419	430

Encryption

When enabling full-disk encryption (LUKS) for the virtual drive, performance dropped a lot again. Of course thats expected to a certain degree but numbers went down below my acceptance criteria:

Pattern	IOPS	MB/s
4K QD4 rnd read	10530	42
4K QD4 rnd write	3637	15
4K QD4 seq read	52819	211
4K QD4 seq write	4216	17
64K QD4 rnd read	1710	109
64K QD4 rnd write	1178	75
64K QD4 seq read	3269	209
64K QD4 seq write	1217	78
1M QD4 rnd read	141	145
1M QD4 rnd write	94	97
1M QD4 seq read	155	159
1M QD4 seq write	94	96

There actually is a catch with encryption, which is that the encryption layer tries to be as fast as possible and therefore encrypts blocks in parallel, which can mess up optimizations of writing blocks sequentially. I have not validated this in detail but in fact going single-core within the VM did show a 25% improvement on small request sizes. Anyway i don’t want to sacrifice CPU cores, especially not when doing encryption all the time. Since encryption is not really storage related, i compared encryption speed on the host and on the VM:

$ cryptsetup benchmark
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   207.6 MiB/s   243.0 MiB/s
 serpent-cbc   128b    82.0 MiB/s   310.6 MiB/s
 twofish-cbc   128b   168.7 MiB/s   192.0 MiB/s
     aes-cbc   256b   191.4 MiB/s   199.6 MiB/s
 serpent-cbc   256b    88.3 MiB/s   278.8 MiB/s
 twofish-cbc   256b   151.6 MiB/s   171.5 MiB/s
     aes-xts   256b   266.2 MiB/s   251.4 MiB/s
 serpent-xts   256b   286.3 MiB/s   285.9 MiB/s
 twofish-xts   256b   191.7 MiB/s   195.6 MiB/s
     aes-xts   512b   201.8 MiB/s   197.8 MiB/s
 serpent-xts   512b   276.3 MiB/s   261.3 MiB/s
 twofish-xts   512b   187.0 MiB/s   185.7 MiB/s

Quite consistent results, however looking at the host did reveal a different truth:

$ cryptsetup benchmark
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b  1036.2 MiB/s  3206.6 MiB/s
 serpent-cbc   128b    83.9 MiB/s   658.9 MiB/s
 twofish-cbc   128b   192.5 MiB/s   316.4 MiB/s
     aes-cbc   256b   767.6 MiB/s  2538.9 MiB/s
 serpent-cbc   256b    83.9 MiB/s   657.0 MiB/s
 twofish-cbc   256b   198.2 MiB/s   356.7 MiB/s
     aes-xts   256b  3152.5 MiB/s  3165.3 MiB/s
 serpent-xts   256b   612.8 MiB/s   541.7 MiB/s
 twofish-xts   256b   343.1 MiB/s   351.5 MiB/s
     aes-xts   512b  2361.9 MiB/s  2483.2 MiB/s
 serpent-xts   512b   632.8 MiB/s   622.9 MiB/s
 twofish-xts   512b   349.5 MiB/s   352.1 MiB/s

Numbers for AES based algorithms are through the roof on the host. The reason for this is a native AES implementation on recent Intel CPUs called AES-NI. Proxmox defaults the KVM “CPU model” to “kvm64”, which does not pass through AES-NI. Using host CPU type exposes the CPU directly to the VM which led to a huge boost again. Note that this might be a security risk on shared systems. In my case i’m in full control of the system so it does not matter. So lets check the final results:

Pattern	IOPS	MB/s
4K QD4 rnd read	26449	106
4K QD4 rnd write	6308	25
4K QD4 seq read	158490	634
4K QD4 seq write	6387	26
64K QD4 rnd read	9092	582
64K QD4 rnd write	2317	148
64K QD4 seq read	17847	1116
64K QD4 seq write	2308	148
1M QD4 rnd read	454	466
1M QD4 rnd write	240	246
1M QD4 seq read	806	826
1M QD4 seq write	223	229

Finally my VM is reaching the goal of saturating a 1Gbps link. 150 - 250MB/s random write on 3 disks while using encryption and compression is pretty neat!

Lessons learned

Always question and validate changes done to complex systems
Use virtio-blk, host CPU, iothread and no storage cache on KVM
Make sure dataset block size is aligned to hardware
Consider disabling compression, and access time on datasets
Avoid using LVM within VMs, consider ext3 over ext4

#tech