Offloading Hash Computations in MinIO* to Intel® QuickAssist Technology
Today, AI and big data are flourishing, and object storage is becoming more and more popular. As one of the most popular object storage systems, MinIO* is widely adopted, even as the native object storage system of Kubernetes*. After conducting an analysis of MinIO, we found several hot spots that consume too much CPU. The biggest hot spot is the calculation of the hash value. If this part of the computing power can be offloaded to specific acceleration hardware, it can significantly reduce the drain on the CPU, thereby improving the overall performance of the system. There are increasingly popular hardware acceleration solutions, some based on FPGAs, others based on ASICs. This article introduces an ASIC solution using Intel® QuickAssist Technology (Intel® QAT), that is available on Intel® Xeon® Scalable processors.
Analytic data shows that the default hash algorithm MD5 in MinIO consumes about 20% of the entire system CPU resource, as shown in Figure 1 below:
Figure 1: MinIO hotspot captured by Linux* perf
Figure2: VTune™ analyzer captured similar hot spot
Figure 3 below shows the typical topology of MinIO. MinIO servers run in lots of machines with almost equal numbers of drives. The command to run MinIO servers is similar to the following:
./minio server http://<server IPs>/<paths to drive>
MinIO is smart enough to organize drives as zones. When an object is put to MinIO, the object is split into parts and saved in different drives in a zone. MinIO supports S3 protocol and clients can access MinIO through language specific SDKs, such as Java* SDK (https://docs.min.io/docs/java-client-api-reference.html), python* SDK (https://docs.min.io/docs/python-client-api-reference.html), and .net SDK (https://docs.min.io/docs/dotnet-client-api-reference.html). MinIO clients can access the server directly or via proxy.
Figure3: MinIO typical topology
To fix the hot spot issue, let’s see what the put-object workflow looks like in MinIO. As shown in Figure 4 below, the put object operation is conducted in sequence. The CPU carries the MD5 calculation immediately every time it receives a block from a remote client via the NIC. Then it does erasure code calculation and writes the parts to drives in the zone. The hot spot happens in the box “MD5 calculation by CPU”, which consumes approximately 20% of the CPU resources.
Figure 4: Put object workflow in current implementation
To offload MD5 calculation to Intel QAT, let’s move this heavy workload to the last cycle of the receive - erasure code - write data loop, and optimize as the following:
- Let MinIO directly put the received data to the buffer provided by Intel QAT. In this way, there is no need to copy data between MinIO and Intel QAT hardware, and throughput can be improved by approximately 3%, by our measurement. The wrapper library of Intel QAT allocates the continuous physical memory at the very beginning when the MinIO server starts. It maintains these buffers in a pool and reuses these buffers when the next objects come.
- Trigger Intel QAT to do the MD5 calculation in a separate go routing and run in parallel with the last cycle of erasure code - write data operation. In this way, the time Intel QAT consumes can almost be ignored.
- Switch to the CPU and let it do the MD5 calculation if Intel QAT is almost out of resources.
- Allocate continuous memory from the same NUMA socket where the Intel QAT device resides.
- Bind Intel QAT instances to at least six CPU cores in the same NUMA socket.
- Put Intel QAT in the same NUMA node as the network adapter.
Figure 5: Optimized workflow
The performance is measured in a cluster with four servers as shown in the figure below:
Figure 6: Test environment topology
Table 1 below lists the details of the configuration:
Intel® Xeon® Platinum 8260M CPU @ 2.40GHz * 2 sockets
DDR4 2666 MHz 32G * 8
Mellanox* Technologies MT27800 Family [ConnectX-5] 100Gbps
Mellanox MT27700 [ConnectX-4] 40Gbps
Intel P4510 1.5TB * 4
Intel QAT version
Intel® C62x Chipset, Intel QAT Technology (rev 04)
Table1: Test environment
After enabling hyperthreading, there is a total of 96 CPU cores in the server. However, a typical storage server is equipped with a low-end CPU, e.g. Intel® Xeon® Silver 4xxx serial with 10 cores. Therefore, we limited the number of CPU cores in the server (10.0.0.6) to simulate the real scenario.
There are four SSDs in each server. We split those to four zones, four SSDs each.
Figure 7: Performance result with 16 CPU Cores
In the 16 cores environment, an ~11% performance gain is achieved when putting objects with a size of 128 MB. If the object size is too small, e.g. 4MB, due to the deep stack of the hardware offloading, there’s no performance gain.
Figure 8: Performance result with different CPU Cores
When increasing the CPU core number from 16 to 24 and then to 32, the performance gain is getting smaller, even gets worse. This is because:
1) When the number of CPU cores increases, CPU can easily calculate the MD5 checksum immediately every time the reader received data from client. The table below shows the time spent in every phase when the CPU core number increases from 16 to 32. In all the measurements, CPU utilization are always ~90%, but the time spent on every phase is different. Take “Erasure Code” for example; the time used is 133.7ms with 16 cores, but only 36.8ms with 32 cores. MD5 calculation is handled by intel QAT, so the time is almost the same. Polling mode is used in the Intel QAT library, which is why “MD5 calculation” is still slightly impacted by CPU core number – see 65.5 vs 59.0.
Intel QAT is not faster than the CPU, so it is meaningless to offload when the CPU has enough resources.
2) Intel QAT needs the whole object be received before it can do MD5 calculation, so it indeed costs time to do the calculation. When the CPU is busy, the process of “erasure code – write data” is getting slower; the time cost by the Intel QAT MD5 calculation is smaller than this process, so it can be well paralleled with this process. But if the CPU has enough resources, this process is very fast. Even MinIO needs to wait for Intel QAT to complete. More optimization can be introduced here: Don’t offload when the CPU is not busy.
3) An Intel QAT offload may trigger a thread switch when MinIO is faster than Intel QAT and then waits on a pipe.
User scenario and limitation
This optimization can only benefit scenarios when the MinIO cluster runs on a server with a low end CPU, or when the CPU is the bottleneck. Otherwise, the performance gain can be limited.
In this article, we have shown that MinIO performance in a typical storage server can be improved by using Intel QAT for MD5 calculation. By offloading workloads to the hardware accelerator, the CPU can handle more disk requests, which means that more disks can be mounted on the same server, which can reduces server expenditure per unit of disk capacity. PCIe5.0 are coming soon, and the performance improvement using accelerators will become more and more obvious in the future.
If you want to try this approach, download the code below and run it in your environment:
- MinIO: https://github.com/liangintel/minio
- golang wrapper for Intel QAT library: https://github.com/liangintel/md5accel
Intel, VTune, and Xeon are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.