Sorry, you need to enable JavaScript to visit this website.

Offloading Hash Computations in MinIO* to Intel® QuickAssist Technology

BY Liang Fang, Hualong Feng, Kien Dinh, Vivian Zhu ON Dec 07, 2020

Introduction

Today, AI and big data are flourishing, and object storage is becoming more and more popular. As one of the most popular object storage systems, MinIO* is widely adopted, even as the native object storage system of Kubernetes*. After conducting an analysis of MinIO, we found several hot spots that consume too much CPU. The biggest hot spot is the calculation of the hash value. If this part of the computing power can be offloaded to specific acceleration hardware, it can significantly reduce the drain on the CPU, thereby improving the overall performance of the system. There are increasingly popular hardware acceleration solutions, some based on FPGAs, others based on ASICs. This article introduces an ASIC solution using Intel® QuickAssist Technology (Intel® QAT), that is available on Intel® Xeon® Scalable processors.

The Analysis

Analytic data shows that the default hash algorithm MD5 in MinIO consumes about 20% of the entire system CPU resource, as shown in Figure 1 below:

Figure 1: MinIO hotspot captured by Linux* perf

 

Figure2: VTune™ analyzer captured similar hot spot

Figure 3 below shows the typical topology of MinIO. MinIO servers run in lots of machines with almost equal numbers of drives. The command to run MinIO servers  is similar to the following:

./minio server http://<server IPs>/<paths to drive>

MinIO is smart enough to organize drives as zones. When an object is put to MinIO, the object is split into parts and saved in different drives in a zone. MinIO supports S3 protocol and clients can access MinIO through language specific SDKs, such as Java* SDK (https://docs.min.io/docs/java-client-api-reference.html), python* SDK (https://docs.min.io/docs/python-client-api-reference.html), and .net SDK (https://docs.min.io/docs/dotnet-client-api-reference.html). MinIO clients can access the server directly or via proxy.

 

Figure3: MinIO typical topology

To fix the hot spot issue, let’s see what the put-object workflow looks like in MinIO. As shown in Figure 4 below, the put object operation is conducted in sequence. The CPU carries the MD5 calculation immediately  every time it receives a block from a remote client via the NIC. Then it does erasure code calculation and writes the parts to drives in the zone. The hot spot happens in the box “MD5 calculation by CPU”, which consumes approximately 20% of the CPU resources.

Figure 4: Put object workflow in current implementation

The solution

To offload MD5 calculation to Intel QAT, let’s move this heavy workload to the last cycle of the receive - erasure code - write data loop, and optimize as the following:

  1. Let MinIO directly put the received data to the buffer provided by Intel QAT. In this way, there is no need to copy data between MinIO and Intel QAT hardware, and throughput can be improved by approximately 3%, by our measurement. The wrapper library of Intel QAT allocates the continuous physical memory at the very beginning when the MinIO server starts. It maintains these buffers in a pool and reuses these buffers when the next objects come.
  2. Trigger Intel QAT to do the MD5 calculation in a separate go routing and run in parallel with the last cycle of erasure code - write data operation. In this way, the time Intel QAT consumes can almost be ignored.
  3. Switch to the CPU and let it do the MD5 calculation if Intel QAT is almost out of resources.
  4. Allocate continuous memory from the same NUMA socket where the Intel QAT device resides.
  5. Bind Intel QAT instances to at least six CPU cores in the same NUMA socket.
  6. Put Intel QAT in the same NUMA node as the network adapter.

Figure 5: Optimized workflow

Performance results

The performance is measured in a cluster with four servers as shown in the figure below:

Figure 6: Test environment topology

Table 1 below lists the details of the configuration:

CPU

Intel® Xeon® Platinum 8260M CPU @ 2.40GHz * 2 sockets

Memory

DDR4 2666 MHz 32G * 8

NIC

Mellanox* Technologies MT27800 Family [ConnectX-5] 100Gbps

Mellanox MT27700 [ConnectX-4] 40Gbps

Storage

Intel P4510 1.5TB * 4

OS

Ubuntu* 18.04

Intel QAT version

Intel® C62x Chipset, Intel QAT Technology (rev 04)

Table1: Test environment

After enabling hyperthreading, there is a total of 96 CPU cores in the server. However, a typical storage server is equipped with a low-end CPU, e.g. Intel® Xeon® Silver 4xxx serial with 10 cores. Therefore, we limited the number of CPU cores in the server (10.0.0.6) to simulate the real scenario.

There are four SSDs in each server. We split those to four zones, four SSDs each.

   

Figure 7: Performance result with 16 CPU Cores

In the 16 cores environment, an ~11% performance gain is achieved when putting objects with a size of 128 MB. If the object size is too small, e.g. 4MB, due to the deep stack of the hardware offloading, there’s no performance gain.

Figure 8: Performance result with different CPU Cores

 

When increasing the CPU core number from 16 to 24 and then to 32, the performance gain is getting smaller, even gets worse. This is because:

1) When the number of CPU cores increases, CPU can easily calculate the MD5 checksum immediately every time the reader received data from client. The table below shows the time spent in every phase when the CPU core number increases from 16 to 32. In all the measurements, CPU utilization are always ~90%, but the time spent on every phase is different. Take “Erasure Code” for example; the time used is 133.7ms with 16 cores, but only 36.8ms with 32 cores. MD5 calculation is handled by intel QAT, so the time is almost the same. Polling mode is used in the Intel QAT library, which is why “MD5 calculation” is still slightly impacted by CPU core number – see 65.5 vs 59.0.

 

Intel QAT is not faster than the CPU, so it is meaningless to offload when the CPU has enough resources.

2) Intel QAT needs the whole object be received before it can do MD5 calculation, so it indeed costs time to do the calculation. When the CPU is busy, the process of “erasure code – write data” is getting slower; the time cost by the Intel QAT MD5 calculation is smaller than this process, so it can be well paralleled with this process. But if the CPU has enough resources, this process is very fast. Even MinIO needs to wait for Intel QAT to complete. More optimization can be introduced here: Don’t offload when the CPU is not busy.

3) An Intel QAT offload may trigger a thread switch when MinIO is faster than Intel QAT and then waits on a pipe.

User scenario and limitation

This optimization can only benefit scenarios when the MinIO cluster runs on a server with a low end CPU, or when the CPU is the bottleneck. Otherwise, the performance gain can be limited.

Conclusion

In this article, we have shown that MinIO performance in a typical storage server can be improved by using Intel QAT for MD5 calculation. By offloading workloads to the hardware accelerator, the CPU can handle more disk requests, which means that more disks can be mounted on the same server, which can reduces server expenditure per unit of disk capacity. PCIe5.0 are coming soon, and the performance improvement using accelerators will become more and more obvious in the future.

If you want to try this approach, download the code below and run it in your environment:

References

Disclaimers

Intel, VTune, and Xeon are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.