Here’s a second post focusing on the performance of MySQL on ZFS in cloud environments. In the first post, MySQL/ZFS Performance Update, we compared the performances of ZFS and ext4. This time we’ll look at the benefits of using ephemeral storage devices. These devices, called ephemeral in AWS, local in Google cloud, and temporary in Azure, are provided directly by the virtualization host. They are not network-attached and are not IO throttled, at least compared to regular storage. Not only can they handle a high number of IOPs, but their IO latency is also very low. For simplicity, we’ll name these devices local ephemeral. They can be quite large: Azure lsv2, Google Cloud n2, and AWS i3 instance types offer TBs of fast NVMe local ephemeral storage.
The main drawback of local ephemeral devices is the loss of all the data if the VM is terminated. For that reason, the usage of local ephemeral devices is limited with databases like MySQL. Typical use cases are temporary reporting servers and Percona XtraDB Cluster (PXC)/Galera cluster nodes. PXC is a bit of a wild case here: the well polished and automated full state transfer of Galera overcomes the issue caused by having to reload the dataset when a cluster node is recycled. Because of data compression, much more data can be stored on an ephemeral device. Actually, our TPCC dataset fits on the 75GB of temporary storage when compressed. Under such circumstances, the TPCC performance is stellar as shown below.
On the local ephemeral device, the TPCC transaction rate is much higher, hovering close to 200 per minute. The ZFS results on the regular SSD Premium are included as a reference. The transaction rate during the last hour was around 50 per minute. Essentially, with the use of the local ephemeral device, the load goes from IO-bound to CPU-bound.
Of course, it is not always possible to only use ephemeral devices. We’ll now explore a use case for an ephemeral device, as a caching device for the filesystem, using the ZFS L2ARC.
What is the ZFS L2ARC?
Like all filesystems, ZFS has a memory cache, called the ARC, to prevent disk IOPs from retrieving frequently used pieces of data. The ZFS ARC has a few additional tricks up its sleeve. First, when data compression is used on the filesystem, the compressed form is stored in the ARC. This helps store more data. The second ZFS trick is the ability to connect the ARC LRU eviction to a fast storage device, the L2ARC. L2 stands for “Level 2”, a bit like the leveled caches of CPUs.
Essentially, the ZFS ARC is a level 1 cache, and records evicted from it can be inserted into a level 2 cache, the L2ARC. For the L2ARC to be efficient, the device used must have a low latency and be able to perform a high number of IOPs. Those are characteristics of cloud ephemeral devices.
Configuration for the L2ARC
The ZFS L2ARC has many tunables and many of these have been inherited from the recent past when flash devices were much slower for writes than for reads. So, let’s start by the beginning, here is how we add a L2ARC using the local ephemeral device, /dev/sdb to the ZFS pool bench:
# zpool add bench cache /dev/sdb
Then, the cache device appears in the zpool:
# zpool status pool: bench state: ONLINE config: NAME STATE READ WRITE CKSUM bench ONLINE 0 0 0 sdc ONLINE 0 0 0 cache sdb ONLINE 0 0 0
Once the L2ARC is created, if we want data in it, we must start storing data in the ARC with:
# zfs set primarycache=all bench/data
This is all that is needed to get data flowing to the L2ARC, but the default parameters controlling the L2ARC have conservative values and it can be quite slow to warm up the L2ARC. In order to improve the L2ARC performance, I modified the following kernel module parameters:
l2arc_headroom=4 l2arc_write_boost=134217728 l2arc_write_max=67108864 zfs_arc_max=4294967296
Essentially, I am boosting the ingestion rate of the L2ARC. I am also slightly increasing the size of the ARC because the pointers to the L2ARC data are kept in the ARC. If you don’t use a large enough ARC, you won’t be able to add data to the L2ARC. That ceiling frustrated me a few times until I realized the entry l2_hdr_size in /proc/spl/kstat/zfs/arcstats is data stored in the metadata section of the ARC. The ARC must be large enough to accommodate the L2ARC pointers.
L2ARC Impacts on TPCC Results
So, what happens to the TPCC transaction rate when we add a L2ARC? Since we copy the dataset is copied over every time, the L2ARC is fully warm at the beginning of a run. The figure below shows the ZFS results with and without a L2ARC in front of SSD premium Azure storage.
The difference is almost incredible. Since the whole compressed dataset fits into the L2ARC, the behavior is somewhat similar to the direct use of the local ephemeral device. Actually, since the write load is now sent to the SSD premium storage, the performance is even higher. However, after 4000s, the performance starts to degrade.
From what I found, this is caused by the thread feeding the L2ARC (l2arc_feed). As pages are updated by the TPCC workload, they are eventually flushed at a high rate to the storage. The L2ARC feed thread has to scan the ARC LRU to find suitable records before they are evited. This thread then writes it to the local ephemeral device, and updates the pointers in the ARC. Even if the write latency of the local ephemeral device is low, it is significant and it greatly limits the amount of work a single feed thread can do. Ideally, ZFS should be able to use more than a single L2ARC feed thread.
In the event you end up in such a situation with a degraded L2ARC, you can refresh it when the write load goes down. Just run the following command when activity is low:
# tar c /var/lib/mysql/data > /dev/null
It is important to keep in mind that a read-intensive or a moderately write-intensive workload will not degrade as much over time as the TPCC benchmark used here. Essentially, if a replica with one of even a few (2 or 3) replication threads can keep up with the write load, the ZFS L2ARC feed thread will also be able to keep up.
Comparison with bcache
The ZFS L2ARC is not the only option to use a local ephemeral device as a read cache; there are other options like bcache and flashcache. Since bcache is now part of the Linux kernel, we’ll focus on it.
bcache is used as an ext4 read cache extension. Its content is uncompressed, unlike the L2ARC. The dataset is much larger than the size of the local ephemeral device so the impacts are expected to be less important.
As we can see in the above figure, it is exactly what we observe. The transaction rate with bcache is inferior to L2ARC because less data is cached. The L2ARC yielded more than twice the number of transactions over the 2h period than bcache. However, bcache is not without merit, it did help ext4 increase its performance by about 43%.
How to Recreate L2ARC if Missing
By nature, local ephemeral devices are… ephemeral. When a virtual machine is restarted, it could end up on a different host. In such a case, the L2ARC data on the local ephemeral device is lost. Since it is only a read cache, it doesn’t prevent ZFS from starting but you get a pool status similar to this:
# zpool status pool: bench state: ONLINE status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://zfsonlinux.org/msg/ZFS-8000-2Q scan: none requested config: NAME STATE READ WRITE CKSUM bench ONLINE 0 0 0 sdc ONLINE 0 0 0 cache /dev/sdb UNAVAIL 0 0 0 cannot open
In such case, the L2ARC can be easily be fixed with:
# zpool remove bench /dev/sdb # zpool add bench cache /dev/sdb
These commands should be called from a startup script to ensure the L2ARC is sane after a restart.
In this post, we have explored the great potential of local ephemeral devices. These devices are means to improve MySQL performance and reduce the costs of cloud hosting. Either used directly or as a caching device, ZFS data compression and architecture allow nearly triple the number of TPCC transactions executed over a 2 hours period.
There are still a few ZFS related topics I’d like to cover in the near future. Those posts may not be in that order but the topics are: “Comparison with InnoDB compression”, “Comparison with BTRFS”, “ZFS tuning for MySQL”. If some of these titles raise your interest, stay tuned.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!