“7 for all mankind” (7FAM) is a California based brand that sells premium denim wear world-wide. Besides less severe issues with their online shop, they expose sensitive information about their clients like addresses, phone numbers and order details publicly. The flaw was found by chance while using their online shop as a customer.

I will not take credit for any fancy research, anyone could have found this - starting with 7FAM testers, operators and auditors. The overall experience with this case makes it very clear to me that there are no business processes in place to avoid and handle incidents like this. The company appears to primarily consist of marketing, sales and customer service. I run a bug bounty program myself and don’t expect response times like an IT company that deals with vulnerabilities on a daily basis. However, the actual response besides “oh, thats unfortunate” was effectively zero. I don’t blame people for doing their jobs but i am very disappointed by the organization failing to create any awareness.


7FAM was informed back in November of 2017 and I offered to pinpoint and eventually stop the leak. Although communicating on multiple channels i never got any response indicating that someone takes action or at least cares about the topic. Apparently protecting customer data and IT systems in general has no priority for 7FAM and their security strategy is to kick the can down the road. Since the issue was reported half a year ago and is trivial to find, i decided to publish information although the issue is not fixed to date.

My hope is to get the attention of someone within the organization to take this serious since bad guys have or will potentially discover this flaw anyway. At the same time i feel sorry for those individuals which data got exposed by a weak security concept and even worse “management” by 7FAM. At several occasions i mentioned that i will publish this issue six WEEKS after calling it in. Now, after six MONTHS i run out of patience.

Disclosure timeline

  • 2017-11-25 Discovery of the flaw, created demo script
  • 2017-11-25 Attempt to established contact (Mail) no response
  • 2017-11-27 Attempt to established contact (Twitter) no response
  • 2017-11-28 Attempt to established contact (Contact Form) no response
  • 2017-11-28 Attempt to established contact (Website Chat) no response
  • 2017-11-28 Attempt to established contact (LinkedIN) no response
  • 2017-11-28 Attempt to established contact (Facebook Messenger) no response
  • 2017-12-01 First response by 7FAM via Facebook Messenger, rather useless
  • 2017-12-11 More useless responses via Twitter
  • 2017-12-16 Telephone call with customer service
  • 2017-12-17 Several E-Mails to a address given to me, no response
  • 2018-01 Many more E-Mails with customer service, got responses but no action was taken


When you order something at 7FAM online shop, you create an account, set a password and get access to your customer area. Then when purchasing, you get a mail confirmation and later on a receipt with a link to an invoice. Your order appears to be safely accessible through your customer area, however this is not the whole story.

For whatever reason, 7FAM undermines their “customer area” and offers a PDF version of each invoice without authentication. I noticed this when clicking the “download PDF invoice” link at their mails and wondered: “Wait a minute, i logged out - why can i still access my invoice?”

Taking a deeper look made it clear that the “Download invoice” link does not contain any one-time token and not even a pseudo-random path which would make it harder to guess. In fact, every invoice gets an identifier and - yes you have guessed correct - this is an incrementing number.

Now the fun part starts, its not just my invoice which i can access, but rather every invoice of each online purchase made at their european online store since 2011. Ugh! Needless to say that there is no rate limit, expiration or other countermeasure to slow down access to invoices.

So i wrote a script and tried some combinations of identifiers. I ended up with a list of all valid countries and range of invoices. In total, 7FAM publishes about 250.000 invoices.

Playing with numbers

A typical 7FAM order ID looks like 700012345 - sorry Ruth, please complain at 7FAM.

The first part 70 is static and identifies the country at which the purchase was made, for example:

  • 70 (Germany)
  • 60 (UK)
  • 11 (France)

The second part 0012345 is the unique and incrementing order identifier.

When downloading the invoice, a URL like this is called:


As a fun fact HTTPS is used to secure data in transit, which is completely useless since there is no authentication. At least we can be sure that the invoice originates from the advertised source :-D

There are plenty of unprotected invoices out there, judging from the last valid invoice ID.

Country Amount
Germany 75.000
United Kingdom 63.000
Netherlands 32.000
Belgium 21.000
France 14.000
Italy 6.000
Ireland 1.500
Spain 1.000
Total 231.500+


Now, asking the obvious: what could someone do with this? Such invoices contain personal information like address, phone numbers, payment and shipping data, gender, body metrics and financial capabilities. Taking the typical demography of 7FAM customers into account, this makes the leaked data very valuable for unsolicited advertising, data validation, data enrichment, identity theft, harassment and stalking. This flaw also provides insights to internal 7FAM sales metrics for the region.

One aspect is really ugly, which is stalking. There are a lot of sick fucks out there that would really like to get a list of wealthy women around them that wear size 27 skinny jeans. 7FAM just provides them this data and they even get a phone number on top. I’m quite sure 7FAM customers would agree that this is a major breach.

Despite being mentioned multiple times during conversations with 7FAM, their “Internal IT Team” did not ever contact me. Interestingly enough my requests to help and provide details were shut down with “IT is working on it” nonetheless. It’s beyond my imagination what they were working on while apparently not knowing anything about the leak.

This whole process is a sad prime example for how NOT to handle security incidents. Such things will not just go away and rejecting help without having any information at hand is rather pointless. Denial will not get you anywhere, its all about having the right means, knowledge and awareness within your organization. Not just because GDPR tells you to…

A generic piece of advice on tuning

ZFS is a mature piece of software, engineered by file- and storage-system experts with lots of knowledge from practical experience. Sun invested a lot of money and built enterprise grade appliances around it for a decade. Always keep this in mind when optimizing it, there is reason to trust that defaults are chosen very reasonably even though they may not appear obvious at first look. Mind that enterprise class storage systems are primarily about safety and not about speed at all cost. If there would be settings that were better in all regards, someone would have already figured out and made them the default.

That being said, there is usually a chance to optimize software in order to fit the specific workload and environment. Doing so can save a lot of money compared to throwing more expensive hardware on a problem. This process is based on knowledge, context, testing methodology and goals. Always verify that a change has actual holistic positive impact and adjust back if not. Optimizing complex systems requires a systematic approach, which is not pasting every setting that has been suggested on the internet. It’s very likely that random settings which worked for a specific case won’t yield any improvement for another case but instead introduce problems or even data loss. The same applies to any suggestion given at this article, it could be totally worthless for you to replicate settings that worked well for me.

Before actually changing anything, make sure you understand the underlying concepts, have read or listened to all relevant documentation and are in a position to second-guess suggestions made by others. Make sure you understand the context in which ZFS is operating and define plausible success criteria. It does not make sense to aim for 2000 IOPS at 4K blocks out of a single HDD or expect 1GB/s throughput on encrypted storage on a Raspberry. It’s also not useful to expect the same kind of performance for any given workload since each configuration and optimization stands for itself. If you don’t know how block storage works in general or what parameters are relevant to measure and rate storage systems, then please gather this knowledge. Only if you can say with confidence that you understand what you are doing, why you are doing it, what you roughly expect and have found a proper testing methodology, only then you should attempt to “tune” a complex system such as ZFS.


I’m using a 3-way RAIDZ1 array with HGST HUH721010ALN600 disks (10TB, 7200rpm, 4Kn) and a Intel Optane 900p card as ZIL/L2ARC within a entry-level server (E3-1260L, 32GB, 2x1Gbps) running Debian Linux and Proxmox/KVM for virtualization. Virtual machines (currently 10) run headless Debian Linux and provide general purpose residential services such as Mail, File, Web, VPN, Authentication, Monitoring etc. This article was written while running ZFS on Linux “ZoL” 0.7.6.

Current situation

Storage access within VMs is terribly slow and the host system shows high on IOwait numbers. Especially encrypted disks almost flat-line when moving some data around.

Defining success criteria

My goal is to fully saturate one of the servers 1Gbps links with a 10GB file transfer from within a virtual machine doing full-disk encryption. I want my VMs to be snappy and deliver their service without significant latency, even if other VMs are busy. The first is a objective goal regarding throughput which can be easily measured, the second a subjective one regarding latency.

Storage benchmark background

Benchmark parameters

Storage benchmarks have a few important variables:

  • Test file size, which should be large enough to get past cache sizes and represent real-world usage.
  • IO request size, usually between 4K and 1M, depending on the workload. Databases are more at the 4K side while moving large files is more at the 1M side.
  • Access pattern, random or sequential
  • Queue depth, which is the amount of IO commands that are issued by an application and queued within the controller at the same time. Depending on the drive, those commands can get executed in parallel (SSDs). Some queue saturation can be beneficial to improve performance, however too much parallelism can severely impact latency especially for HDDs.
  • Distribution of write and read access based on application type. Web servers usually trigger 95% read, databases are usually 75% read and 25% write and specific applications like log servers can even use 95% write. This heavily influence how efficient caches are used, for example.

Benchmark results

A very relevant value we get out of this test is latency, which translates to IOPS which translates to throughput. As a rough example, if a IO request takes 1ms (latency) and we apply a request size of 4KiB, this means we will get 4000KiB per second (or 4MiB/s) of throughput out of this device in a perfect scenario. 1ms of latency is already very low for a HDD, which is why HDD suck at small request sizes.

When running random access on spindle storage, throughput can go down even more as read/write heads need to reposition all the time. Solid-state storage does not have that mechanical impairment. If we crank up the request size to 64KB, we suddenly get 64MB/s out of the same drive. Latency is not always the same due to storage device characteristics, especially for random access. Therefor the percentile for latency is more interesting than the average, having a 99th percentile of 1ms means that 99% of all IO requests finished within 1ms, 1% took longer. This gives an idea about consistency of latency.


At some point with lower latency or higher request size we will hit a throughput limit defined by mechanical constraints, internal transfer to cache or by the external interface that handles transfer to the storage controller, usually 6 or 12Gb/s for SATA/SAS, 16 or 32Gb/s for PCIe. Even high-end HDDs are capped by their rotation speed, which affects both latency and throughput. Modern SSDs are usually capped by their external storage interface or by thermal issues when doing sequential access. Random access is usually limited by memory cell technology (NAND, 3D-XPoint) or controller characteristics.

Storage layout decision introduce limitations as well. When running 10 disks in mirror mode they will provide the same write performance like one disk would. Actually it depends on the slowest disk within the array. Of course drives should be matched in such arrays but there are always variances and drive performance tends to degrade over time. Running the same 10 disks as stripe, we can expect almost 10x the performance than a single drive, assuming other components can handle it. A RAIDZ1 with three disks can in theory provide the same level of performance as a single drive would. On top of checksums, ZFS will calculate parity and store it to a second drive. This means RAIDZ1 is quite CPU/Memory hungry and will occupy two disks for a single write request.

File systems itself have characteristics that impact performance. There are simple file-systems like ext2 or FAT which just put a block to disk and read it. Other systems are more advanced to avoid data loss, for example keeping a journal or creating checksums of data which got written. All those extra features require resources and can reduce file-system performance. Last but certainly not least properties like sector sizes should be aligned between file-system and physical hardware to avoid unnecessary operations like read-modify-write.


Caches are very helpful to speed up things, however they are also a disadvantage when doing benchmarks that needs to be taken into consideration. After all, we want to get results for the storage system, not system RAM or other caches. Caches are there for a reason, so they should not be disabled for benchmarking but instead real-world data and pattern needs to be used for testing.

HDDs and NAND SSDs usually have very quick but small internal cache of 128MB to 1GB. This is not just used for buffering but also internal organization, especially for SSDs which need to take care about wear leveling and compression a lot.
Some HBAs have additional caches themselves which are much larger and supports the storage array instead of individual drives.
For ZFS specifically there is a whole range of caches (ZIL, ARC, L2ARC) independently from hardware as ZFS expects to directly access drives with no “intelligent” controller in between. Their way of working could be changed but is optimized for most workloads already, however their size can and should be matched with the system configuration.


First benchmarking

File transfers from and to the server are very unstable, bouncing between 20 and 60MB/s. Those values are not very helpful and include a lot of unnecessary moving parts (client computer, network…) so i decided to locally benchmark the VM for random and sequential read and write. To do so i chose fio which is a handy IO benchmarking tool for Linux and other platforms.

To find out what my array is actually capable of, i started benchmarking ZFS directly at the host system. This removes several layers of indirection, which could hide potential root causes for bad performance. I also started there to find out how different benchmark settings would affect my results.

I created a matrix of benchmark settings and IOPS/throughput results and started with request sizes of 4KiB, 64KiB and 1MiB at a queue-depth of 1, 4, 8 and 16 at random read, random write, sequential read and sequential write patterns. At this point i kept my application profile simple since i was more interested in how read and write perform in general. Again reducing the complexity of having mixed workloads that could hide bottlenecks.

Results did tell me that there is negligible difference between queue-depths, so i sticked with QD4 for all future tests. Second, read performance is crazy high, indicating that ZFS caches are doing what they are supposed to do. The test first creates a data block - which ZFS stores in ARC (aka. DRAM) or L2ARC (Intel Optane 900p) - and then reads the same very same block from those caches. This is not a usual real-world scenario so i put more emphasis on write performance.

fio commands

During my benchmarks i used the following fio parameters. Adjust block size bs accordingly:

Pattern Command
Random read fio --filename=test --sync=1 --rw=randread --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Random write fio --filename=test --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Sequential read fio --filename=test --sync=1 --rw=read --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
Sequential write fio --filename=test --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test

Results from ZFS

Pattern IOPS MB/s
4K QD4 rnd read 47464 190
4K QD4 rnd write 10644 43
4K QD4 seq read 347210 1356
4K QD4 seq write 16020 64
64K QD4 rnd read 62773 3923
64K QD4 rnd write 5039 323
64K QD4 seq read 58514 3657
64K QD4 seq write 5497 352
1M QD4 rnd read 6872 6872
1M QD4 rnd write 645 661
1M QD4 seq read 2348 2348
1M QD4 seq write 664 680

Not so shabby! My system is able to do random writes up to 660MB/s on large request sizes and serve 10k IOPS on small request sizes. This gets certainly supported a lot by ZFS caches and the Optane card, but hey thats what they’re supposed to do. For a 3-disk system i’d call it a day since performance is much better than my success criteria even with default ZFS settings.

However, there still is the fact that performance within VMs is terrible and with the results so far i pretty much ruled out ZFS as the root cause. So what could it be?

Results from VM

Measuring IO within the VM confirms my impression. There is a huge gap compared to the numbers i see at the host, ranging from 85x at 4K to 6x at 1M request sizes.

Pattern IOPS MB/s
4K QD4 rnd read 126 0,5
4K QD4 rnd write 124 0,5
4K QD4 seq read 28192 113
4K QD4 seq write 125 0,5
64K QD4 rnd read 9626 616
64K QD4 rnd write 126 8
64K QD4 seq read 17925 1120
64K QD4 seq write 126 8
1M QD4 rnd read 1087 1088
1M QD4 rnd write 94 97
1M QD4 seq read 1028 1028
1M QD4 seq write 96 99

What the heck is going on here?

Working theories


The following parameter help to adjust ZFS behavior to a specific system. The size of ARC should be defined based on spare DRAM, in my case about 16 out of 32GB RAM are assigned to VMs, so i chose to limit ZFS ARC to 12GB. Doing that requires a Linux kernel module option, which becomes available after reloading the module.

$ vim /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=12884901888

I assigned a quite speedy Intel Optane 900p card as ZIL and L2ARC. By default L2ARC would be stored to the pool, which explains why there is a rather low throughput limitation of 8MB/s for it. Since the Optane card is independent from my HDD i set this to 1GB/s instead. Note that this can harm pool performance in case L2ARC is not using dedicated memory.

$ vim /etc/modprobe.d/zfs.conf
options zfs l2arc_write_max=1048576000

Further low-level tuning seems unnecessary until the VM comes close to the numbers seen at the host. So what can cause this? Looking at the architecture, data within VMs uses the following path:

HDDs <-> Cache <-> ZFS <-> Dataset <-> VM image <-> KVM <-> LVM <-> Encryption <-> VM file system


Disk, ZFS and Cache are ruled out, so lets do a sanity check on my datasets. My VM images are stored on ZFS using datasets like storage/vm-100-disk-1 instead of storing them as file to the pool directly. This setup allows to specify some per-VM settings in ZFS, for example compression. One dataset property in particular made me curious:

$ zfs get all storage/vm-100-disk-1
storage/vm-100-disk-1 volsize 10G local
storage/vm-100-disk-1 volblocksize 8K -
storage/vm-100-disk-1 checksum on default

The volblocksize property is relevant to align the datasets block size with the physical disks sector size. Since i’m using 4Kn disks, my sector size is 4K, not 8K - leading to a misalignment and potentially wasted storage access.

$ cat /sys/block/sda/queue/hw_sector_size

I don’t know exactly why the dataset was created with a 8K volblocksize but since i migrated some datasets around its possible that this was set when originally creating the dataset on SSD. SSDs tend to have 8K blocks. Setting this to a aligned value just makes sense in every way:

$ zfs set volblocksize=4K storage/vm-100-disk-1


Next up is compression. It’s common sense that compression consumes some resources and ZFS is no exception here. It already uses a quite fast and efficient default (LZ4) and i benchmarked the performance impact of switching off compression to be around 10%. Chosing this setting is really not just about speed, depending on the data it can help to severely save space and money. Bechmarks create random data which is hard to compress. I decided to keep it enabled for all datasets since ZFS already figures out if the data it writes can be compressed or not. However, for improved performance it should be disabled:

$ zfs set compression=off storage/vm-100-disk-1


ZFS offers to make every write request to be synchronous instead of asynchronous if the issuing application choses to do so. Having synchronous write makes sure data is actually written to non-volatile memory before confirming the IO request. In case even minimal “in-flight” data loss is unacceptable, one can use sync=always at the expense of some throughput. I found the effect on write performance to be almost 20% and since i’ve a UPS running i decided to use the default again, which allows asynchronous writes. This of course will not save me from PSU or cable failures, but i take the chance.

$ zfs set sync=standard storage/vm-100-disk-1


ZFS has the default of storing the last access time of files. In case of datasets with a RAW image inside, this does not make a lot of sense. Disabling can save a extra write after any storage request.

$ zfs set atime=off storage/vm-100-disk-1

VM image

The RAW image of the VM is quite off the table since its just a bunch of blocks. I’d be careful with using qcow2 images on top of ZFS. ZFS already is a copy-on-write system and two levels of CoW don’t mix that well.


I manage my virtual machines using Proxmox and have chosen KVM as hypervisor. Since its emulating hardware, including mapping the RAW image to a configurable storage interface, there is a good chance to have big impact. Based on some posts i chose virtio-scsi as storage device since i thought its discard feature helps with moving orphaned data out of ZFS. I also chose the writeback cache since its description sounded promising without ever testing its impact. So i played around with some options and found that virtio-block as device and none as cache leads to massive performance improvements! Just look at benchmark results after this change:

Pattern IOPS MB/s
4K QD4 rnd read 19634 79
4K QD4 rnd write 3256 13
4K QD4 seq read 151791 607
4K QD4 seq write 2529 10
64K QD4 rnd read 7922 507
64K QD4 rnd write 909 58
64K QD4 seq read 18044 1128
64K QD4 seq write 1533 98
1M QD4 rnd read 657 673
1M QD4 rnd write 264 271
1M QD4 seq read 805 824
1M QD4 seq write 291 299

The iothread option had minor but still noticeable impact as well:

Pattern IOPS MB/s
4K QD4 rnd read 26240 105
4K QD4 rnd write 4011 16
4K QD4 seq read 158395 634
4K QD4 seq write 3067 12
64K QD4 rnd read 10422 667
64K QD4 rnd write 1495 96
64K QD4 seq read 9087 582
64K QD4 seq write 1557 100
1M QD4 rnd read 908 930
1M QD4 rnd write 254 261
1M QD4 seq read 1650 1650
1M QD4 seq write 303 311

Getting from 124 to 4011 random write IOPS at 4K is quite an impressive improvement already. Turns out that blindly tweaking ZFS/dataset properties can get you in trouble very easy. The biggest issue however was the KVM storage controller setting, which i believe has to be a bug with the controller simulation of virtio-scsi.

File systems

Next in stack would be the file system and volume manager of the virtual machine, which connects to the virtual storage device. I used Debians defaults of LVM and ext4 because defaults are always great, right? Wrong! Even tough LVM is actually just a thin layer it turned out to have quite some effect. Testing with and without LVM has shown that using a plain old GPT or no partition table (if thats an option) led to a 10% improvement. Looking at file systems, xfs and ext4 appear to be bad choices for my environment, switching to ext3 (or ext2) improved performance by another 30% in some cases!

Pattern IOPS MB/s
4K QD4 rnd read 30393 122
4K QD4 rnd write 4222 17
4K QD4 seq read 164456 658
4K QD4 seq write 3281 13
64K QD4 rnd read 9256 592
64K QD4 rnd write 1813 116
64K QD4 seq read 694 711
64K QD4 seq write 1877 120
1M QD4 rnd read 1207 1207
1M QD4 rnd write 385 395
1M QD4 seq read 1965 1966
1M QD4 seq write 419 430


When enabling full-disk encryption (LUKS) for the virtual drive, performance dropped a lot again. Of course thats expected to a certain degree but numbers went down below my acceptance criteria:

Pattern IOPS MB/s
4K QD4 rnd read 10530 42
4K QD4 rnd write 3637 15
4K QD4 seq read 52819 211
4K QD4 seq write 4216 17
64K QD4 rnd read 1710 109
64K QD4 rnd write 1178 75
64K QD4 seq read 3269 209
64K QD4 seq write 1217 78
1M QD4 rnd read 141 145
1M QD4 rnd write 94 97
1M QD4 seq read 155 159
1M QD4 seq write 94 96

There actually is a catch with encryption, which is that the encryption layer tries to be as fast as possible and therefore encrypts blocks in parallel, which can mess up optimizations of writing blocks sequentially. I have not validated this in detail but in fact going single-core within the VM did show a 25% improvement on small request sizes. Anyway i don’t want to sacrifice CPU cores, especially not when doing encryption all the time. Since encryption is not really storage related, i compared encryption speed on the host and on the VM:

$ cryptsetup benchmark
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 207.6 MiB/s 243.0 MiB/s
serpent-cbc 128b 82.0 MiB/s 310.6 MiB/s
twofish-cbc 128b 168.7 MiB/s 192.0 MiB/s
aes-cbc 256b 191.4 MiB/s 199.6 MiB/s
serpent-cbc 256b 88.3 MiB/s 278.8 MiB/s
twofish-cbc 256b 151.6 MiB/s 171.5 MiB/s
aes-xts 256b 266.2 MiB/s 251.4 MiB/s
serpent-xts 256b 286.3 MiB/s 285.9 MiB/s
twofish-xts 256b 191.7 MiB/s 195.6 MiB/s
aes-xts 512b 201.8 MiB/s 197.8 MiB/s
serpent-xts 512b 276.3 MiB/s 261.3 MiB/s
twofish-xts 512b 187.0 MiB/s 185.7 MiB/s

Quite consistent results, however looking at the host did reveal a different truth:

$ cryptsetup benchmark
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 1036.2 MiB/s 3206.6 MiB/s
serpent-cbc 128b 83.9 MiB/s 658.9 MiB/s
twofish-cbc 128b 192.5 MiB/s 316.4 MiB/s
aes-cbc 256b 767.6 MiB/s 2538.9 MiB/s
serpent-cbc 256b 83.9 MiB/s 657.0 MiB/s
twofish-cbc 256b 198.2 MiB/s 356.7 MiB/s
aes-xts 256b 3152.5 MiB/s 3165.3 MiB/s
serpent-xts 256b 612.8 MiB/s 541.7 MiB/s
twofish-xts 256b 343.1 MiB/s 351.5 MiB/s
aes-xts 512b 2361.9 MiB/s 2483.2 MiB/s
serpent-xts 512b 632.8 MiB/s 622.9 MiB/s
twofish-xts 512b 349.5 MiB/s 352.1 MiB/s

Numbers for AES based algorithms are through the roof on the host. The reason for this is a native AES implementation on recent Intel CPUs called AES-NI. Proxmox defaults the KVM “CPU model” to “kvm64”, which does not pass through AES-NI. Using host CPU type exposes the CPU directly to the VM which led to a huge boost again. Note that this might be a security risk on shared systems. In my case i’m in full control of the system so it does not matter. So lets check the final results:

Pattern IOPS MB/s
4K QD4 rnd read 26449 106
4K QD4 rnd write 6308 25
4K QD4 seq read 158490 634
4K QD4 seq write 6387 26
64K QD4 rnd read 9092 582
64K QD4 rnd write 2317 148
64K QD4 seq read 17847 1116
64K QD4 seq write 2308 148
1M QD4 rnd read 454 466
1M QD4 rnd write 240 246
1M QD4 seq read 806 826
1M QD4 seq write 223 229

Finally my VM is reaching the goal of saturating a 1Gbps link. 150 - 250MB/s random write on 3 disks while using encryption and compression is pretty neat!

Lessons learned

  1. Always question and validate changes done to complex systems
  2. Use virtio-blk, host CPU, iothread and no storage cache on KVM
  3. Make sure dataset block size is aligned to hardware
  4. Consider disabling compression, and access time on datasets
  5. Avoid using LVM within VMs, consider ext3 over ext4

I recently started to replace the HDD storage of my home server since my three WD RED 4TB drives got rather old and i required more space. After lots of experimenting i ended up with ZFS, three new HGST 10TB drives and a shiny Optane 900p. Here is my story so far.

What is ZFS?

There are many videos, articles and other documentation out there, describing in detail what ZFS is. Lets make this brief. ZFS is a copy-on-write file system created by Sun Microsystems for Solaris and available under a Open-Source (-ish) license for Linux and other operating systems. It combines the abilities of a volume manager (like LVM) with a file system (like ext4). Compared to most other file systems, it natively handles multi-device situations by creating all kinds of stripes, mirrors and parity based constructs for data redundancy. Unlike any other file system (yes i know BTRFS…) it sets its priority on data consistency, self-healing capabilities, error prevention and has a proven track-record in the enterprise storage industry.

ZFS works best when exposing disks directly to it, favouring a “JBOD” configuration over RAID controllers. It strictly is NOT “Software RAID / Ghetto RAID”, in fact it offers feature no other file system or hardware RAID controller can offer. Lets face it, RAID controllers are just expensive, optimized computers with crappy, often incompatible firmware and a bunch of SATA/SAS connectors. Since i evaluated multiple solutions (Linux MD, a LSI 9260-8i hardware controller, BTRFS and ZFS) i dare to have an opinion on that topic. The only thing ZFS does not have is a battery-backup unit (“BBU”), however the risk of losing any data during a power outage is extremely low and data corruption can not happen with ZFS. A external UPS is a lot cheaper than a entry level RAID controller with BBU. This only leaves PSU failures, cable errors and software bugs as risk.

As usual there are concessions to make - for ZFS that was higher resource usage (and subsequently potentially lower performance), compared to file systems that care less about data integrity. It has to go many extra miles to make sure data is not just received from disks but the data is actually the correct one, intact, unmodified and gets repaired in case its corrupted. This by the way means using ECC RAM is a very good idea, as faulty data in RAM would lead to “incorrectly repaired” (aka. corrupted) data. Optional features like compression, de-duplication and encryption take an extra toll. ZFS has intelligent caches which are quite memory hungry and can easily use 16GB of available RAM even on small systems. That being said, unused RAM is wasted RAM and its important to understand what ZFS is using it for. To offload some of this resource usage, ZFS allows a second level of caching being written to non-volatile memory called the L2ARC (“Level 2 adaptive replacement cache”) which acts similar to a “read cache”. Next there is a mechanism called ZIL (“ZFS intent log”) which is similar to a “write cache” that collects and streamlines write operations and ZFS then flushes them to disk every couple of seconds.

Performance of ZFS can be greatly enhanced by using a SLOG (“separate log device”) for ZIL and also offload L2ARC to high-speed, low-latency storage. Since DRAM is volatile it’s not a consideration, except some super expensive battery/capacitor buffered DRAM devices. SSDs are a lot more affordable, non-volatile by nature and really fast compared to hard drives. However, compared to DRAM, SSDs are several multitudes slower. Just recently a new technology has been released, claiming to fit between DRAM and traditional SSDs and therefor be an obvious choice for ZIL and L2ARC: Intel Optane.

What is Optane?

  • It’s a product range based on 3D-XPoint memory
  • It’s built for very specific use-cases like caching
  • It’s cheaper than DRAM but more expensive as typical SSDs
  • It uses proprietary memory tech from Intel and Micron
  • It’s NOT a typical SSD, since it’s not based on NAND flash
  • It’s NOT DRAM, since it’s non-volatile

3D-XPoint “3D cross-point” memory technology was announced years ago and first products, called “Optane”, hit the market in early 2017. The first release was a datacenter-grade memory product called “Optane SSD DC P4800X”, available as 375GB and 750GB capacities and as U.2 drive and PCIe card formats. Roughly at the same time some much more consumer oriented “Optane Memory” M.2 cards became available as 16GB and 32GB configuration. In late 2017 Intel released the “Optane SSD 900p” product with capacities of 280GB and 480GB as PCIe card and U.2 drive.

While all Optane products are based on 3D-XPoint memory, their scope and performance varies a lot. Those small “Optane Memory” M.2 cards are meant to serve as system cache/accelerator for HDD-based desktop and mobile computers, while the P4800X and 900P are targeting server and enthusiast desktop computing. The latter two use much more power but also deliver significantly better performance as they pack more 3D-XPoint modules and speedier controllers. The P4800X is Intels top-of-the-line offering and comes with more integrity checks, capacitor based buffer to avoid data loss and better durability. Performance-wise it’s rather close to the 900p, and both share stunning specs.

  • 2500MB/s read, 2000MB/s write
  • 500.000 IOPS read and write
  • 10usec latency for read and write
  • 5PBW, 1.6M hours MTBF
  • 1 sector per 10^17 bits read uncorrectable

Intel claims that those cards require a 7th generation Intel Core CPU, which is just half of the truth. In fact those drives use the NVMe protocol and can be used as regular block device with any current CPU and platform. To run Intels software for automated caching indeed a 7th generation Intel Core CPU is enforced, which appears to be a sales oriented decision. Anyway, for my use-case the 900p meets a 5th generation Xeon E3 CPU on a C232 chipset - and it just works fine.

Now, whats the fuzz about? Why is Optane spectacular? When looking at the typical benchmarks, Optane based products deliver okay-ish performance compared to NAND-based NVMe SSDs like a Samsung 960 Pro - but come as a steep price premium. SSD Benchmarks usually assume large block sizes (>=1M) and high queue-depth (>=16). These values do not represent typical server workloads, in fact i dare to claim they represent almost no relevant workload and are made up by vendors to present large numbers. NAND based SSDs are great in producing high throughput when reading large quantities off many NAND chips in parallel (sequential access), and this is a good thing. However, the fun starts at small block sizes (e.g. 4K) and low queue depths (e.g. 2 or 4) often seen at server workloads like databases. Consumer grade NAND SSDs are usually also terrible at random write performance. Intel claims Optane can fix that.

Benchmarking the beast

Disclaimer: I’ve not received any freebies or been in contact with any of the brands mentioned here. All stuff has been bought from my own money. I understand benchmarks can be in-comprehensive and i admit that the SM951 was in use for some years so it might not produce perfect results anymore. Also the system was running some load during the benchmark and potentially lacking optimization. While my results might not be scientifically perfect, they represent a real-world configuration.

Lets have a look at a Samsung SM951 running at the same system as a Intel Optane SSD 900p, both connected via PCIe x4:

1M blocksize, QD16, random read
$ fio --name=test1M --filename=test1M --size=10000M --direct=1 --bs=1M --ioengine=libaio --iodepth=16 --rw=randread --numjobs=2 --group_reporting --runtime=5
* 900p: 2563 IOPS, 2536 MB/s, 1247 usec avg. latency
* SM951: 2005 IOPS, 2005 MB/s, 1594 usec avg. latency

So far so good, both products are almost toe to toe while the 900p delivers a bit more performance justifying its higher price point. Note that both products appear to maxed out regarding bandwidth. Now, lets write some data.

1M blocksize, QD16, random write
$ fio --name=test1M --filename=test1M --size=10000M --direct=1 --bs=1M --ioengine=libaio --iodepth=16 --rw=randwrite --numjobs=2 --group_reporting --runtime=5
* 900p: 2152 IOPS, 2152 MB/s, 1485 usec avg. latency
* SM951: 399 IOPS, 409 MB/s, 7981 usec avg. latency

Things start to become interesting as the 900p suddenly pulls away with 5x higher IOPS while still being maxed out and bandwidth. Write intense workloads are obviously an issue for consumer NAND SSDs.

As said before, 1M block sizes and a queue-depth of 16 are unusual for server workloads, lets lower the block size to 4K:
4K blocksize, QD16, random read
$ fio --name=test4k --filename=test4k --size=10000M --direct=1 --bs=4k --ioengine=libaio --iodepth=16 --rw=randread --randrepeat=1 --rwmixread=75
* 900p: 310227 IOPS, 1211 MB/s, 51 usec avg. latency
* SM951: 177432 IOPS, 710 MB/s, 90 usec avg. latency

Again, the SM951 does a good job in reading, however the gap becomes a lot bigger. The 900p now delivers 75% better throughput. Let’s write some data…

4K blocksize, QD16, random write
$ fio --name=test4k --filename=test4k --size=10000M --direct=1 --bs=4k --ioengine=libaio --iodepth=16 --rw=randwrite --randrepeat=1 --rwmixread=75
* 900p: 188632 IOPS, 755 MB/s, 84 usec avg. latency
* SM951: 22012 IOPS, 88 MB/s, 712 usec avg. latency

While 22k IOPS are still very respectable from the SM951, the 900p again obliterates it, now producing about 9x higher performance.


Those numbers being crunched, NAND based SSDs remain to be great products, just not for every workload and use-case. 3D-XPoint clearly defines a new standard for such workloads, somewhere in between DRAM and NAND.

Back to specs, the 900p’s endurance is rated as 5PBW (five petabytes written) compared to 400TBW (four hundred terabytes written) of the SM951. The datacenter focused P4800X is even rated at 20PBW. To be fair on specs, the 900p uses a lot more power (5W idle, 14W load) compared to 40mW idle and 5W load of the Samsung and other NAND SSDs.

Both the latency advantage and higher durability make 3D-XPoint based products a very interesting device for enterprise workloads and caching. Therefor i decided to get a 900p and use it as cache device for my home server. Before doing so yourself, consider that Optane is a 1st generation product, there are likely to be improved cards around the corner.

Upgrading my home server

The server runs a bunch of KVM managed by Proxmox, sports a E3-1260L CPU, 32GB of DDR4 ECC memory and a P10S-I board.

Spinning up ZFS

Creating the primary storage pool is quite straight forward:
$ zpool create -O compression=lz4 -O normalization=formD -o ashift=12 storage raidz1 ata-HGST_HUH721010ALN600_1SJ5HXXX ata-HGST_HUH721010ALN600_1SJ5JXXX ata-HGST_HUH721010ALN600_1SJ6KXXX


  • compression=lz4 means LZ4 compression is used on compressible data. ZFS will find out if a block is actually compressible.
  • normalization=formD means file names are stored as normalized UTF-8
  • ashift=12 means native 4K blocks are used, which my drives feature
  • raidz1 means the provided drives are organized in a way traditional RAID5 does, storing a parity as redundancy to allow recovering one failed drive


ZFS is quite reasonably configured by default, however there are a few useful knobs to adjust to both workload and hardware. Please always verify that a change has positive impact and adjust, there is no perfect universal config otherwise this would be the default anyway. I’ll write a separate post about file-system tuning in a broader scope.

Adding Optane ZIL/L2ARC

To use the Optane 900p as caching devices, i created a GPT partition table with a 10GB ZIL (“log”) and 120GB L2ARC (“cache”) partition. Adding them to the pool is easy:

$ zpool add storage log nvme-INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part1
$ zpool add storage cache nvme-INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part2

Now my pool looks like this:

$ zpool status -v
pool: storage
state: ONLINE
scan: scrub repaired 0B in 20h37m with 0 errors on Sun Feb 11 21:01:09 2018

storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-HGST_HUH721010ALN600_1SJ5HXXX ONLINE 0 0 0
ata-HGST_HUH721010ALN600_1SJ5JXXX ONLINE 0 0 0
ata-HGST_HUH721010ALN600_1SJ6KXXX ONLINE 0 0 0
nvme-INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part1 ONLINE 0 0 0
nvme-INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part2 ONLINE 0 0 0

errors: No known data errors

Migrating images

I was previously using the “qcow2” disk format on ext4, which is now a bad idea since ZFS is already a copy-on-write system. Those images can easily be transformed to RAW images and dd’ed back to the ZFS dataset.

$ qemu-img convert -f qcow2 -O raw vm-100-disk-1.qcow2 vm-100-disk-1.raw
$ dd if=vm-100-disk-1.raw of=/dev/zvol/storage/vm-100-disk-1 bs=1M

ZFS allows to create sparse datasets, which will only grow if their space is actually used. Since zeros are highly compressible, writing and deleting a large “zero file” within the VMs can actually free up ZFS storage. After moving to RAW images, run the following within the VM:

$ dd if=/dev/zero of=zerofile bs=1M
$ rm zerofile

Swapping to Optane

Since i’m running virtual machines, there is another thing which should go to low-latency storage: swap. I try to conserve as much memory as possible, which means VMs sometimes use their swap space, which gets horribly slow in case it resides on spinning disks. For that reason i created another partition, created a separate ZFS pool and created disk images that will hold the VMs swamp data.

Creating a new pool is very simple and as i don’t need redundancy on swap it will just be one “device”, actually a partition. Using unique hardware identifiers instead of device paths (e.g. “/dev/nvme0n1p3”) is quite helpful as PCIe enumeration and partition order may change.

$ zpool create -O normalization=formD -O sync=always swaps INTEL_SSDPED1D280GA_PHMXXX2301DU280CGN-part4

Now new virtual disks are created on this ZFS pool and get attached to their virtual machine.

$ zfs list
swaps 33.1M 96.8G 24K /swaps
swaps/vm-100-disk-1 30K 96.8G 30K -
swaps/vm-101-disk-1 1.02M 96.8G 1.02M -

Replacing old swap and re-claiming that space for the root partition is easy if the VMs are using LVM. /dev/sdb is the new virtual device available to the VM, stored at the ZFS “swaps” pool on Optane.

Add the new swap space to LVM:

$ pvcreate /dev/sdb
$ vgcreate vm-optane /dev/sdb
$ lvcreate -l 100%FREE -n swap vm-optane

Create the swap file system and use the UUID as device identifier in /etc/fstab:

$ mkswap /dev/vm-optane/swap 
$ vim /etc/fstab

Disable and remove the old swap partition:

$ swapoff /dev/vm-system/swap 
$ lvremove /dev/vm-system/swap

Extend the root partition and file system to use the free’d up space:

$ lvextend -l +100%FREE /dev/vm-system/root
$ resize2fs /dev/vm-system/root

…and reboot the VM, just to be sure the file system is undamaged.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now