Sorry, you need to enable JavaScript to visit this website.

Intel® Optane™ Technology-equipped Storage Solution to Accelerate the China Unicom Cloud

BY Liang Fang ON Jun 10, 2020

Introduction

It’s no surprise to see intensive IO reads/writes on your cloud system. Have you encountered the bottleneck of disk read/write? If you are in trouble because of it, this whitepaper is for you.

You may have experienced performance that was explosive during the pre-test but was terrible after going online. Why does this happen? One of the root causes is the difference of delay between pre-test systems and online systems. Why does delay matter? Let’s take a look at the queue depth of some commonly used applications. The figure below shows that most applications have a read queue depth of less than three. That means applications issue disk IOs in almost synchronized mode. IO delay significantly impacts the performance of these applications.

* Data from: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/shrout-research-cost-benefit-analysis-qc-flash-white-paper.pdf

For a storage system, multiple nodes can be stacked to form a distributed architecture to easily improve the concurrency. But if a system has high latency, it is difficult to resolve; whether by cutting off the processing module on the path or by optimizing the processing speed of the bottleneck module, a solution is complex. Meanwhile the ROI is questionable. At this moment, solving the problem from the underlying hardware level, is the best approach. In this paper, we will introduce the use case to have Intel Optane technology-equipped storage solution accelerate the China Unicom Cloud.

The Intel Optane SSD is an Intel-branded product that is based on Intel® 3D XPoint™ technology. Intel Optane SSDs have extremely low latency and endurance that is much better than NAND SSD. The figure below shows the latency advantages.

Data from: https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html

At present, there are more and more cases of  Intel Optane SSDs used as data cache. In 2019, the China Unicom WoCloud team and Intel conducted joint optimization research for WoCloud based on Intel Optane SSD. Unicom WoCloud’s massive data retrieval platform collects business data from each province. The average daily collection volume of a single province is up to five million records, and the peak value is up to 10 million. After the stored data is processed, it is available to trusted third parties through pre-defined open interfaces. The platform uses ES (Elastic Search) as a distributed search engine. Due to the limitations of cloud host IO read/write performance, large-scale queries and conditional query responses are slow. Most ES scenarios are more reading than writing. Therefore, while it is important to optimize the system's table division and ES retrieval, there are more important optimization points, such as exploring and improving the cloud host I/O read/write performance and reducing the latency of read requests.

WoCloud is built on OpenStack*, and the backend storage and computing nodes are connected by ISCSI protocol or RBD. The compatibility and stability of these connection methods are very good, but there is a fatal weakness that the latency is too high, and the typical delay is up to milliseconds. For example, in the following simulation test, the average random read delay reaches two milliseconds. Random write latency is roughly twice that of read.

IO Type

Avg Latency(us)

P99

P99.9

P99.99

IOPS

Rand Read

2006.2

2343

2835

9110

19142.84

Rand Write

4991.5

6390

8160

14484

2496.16

Rand R/W(7/3)

2030.72/5438.97

2278/7046

2835/11731

20055/17171

5340.96/2292.75

 

Directly migrating the application to an Intel Optane SSD is the best choice. Considering the cost-effectiveness to business, we use it as a cache, which requires the cache management layer, OpenCAS. OpenCAS is a project derived from Intel® Cache Acceleration Software (Intel® CAS). OpenCAS virtualizes a block device in the system. The application is deployed on the block device and all I/O requests are received and processed by OpenCAS. The figure below shows the IO read/write path and the simplified IO stack.

Like the mainstream caching software, OpenCAS supports multiple caching modes, including Write-Through (WT), Write-Around (WA), Write-Invalidate (WI), Write-Back (WB), Write-Only (WO), and Pass-Through (PT). In the final design, OpenCAS would be installed on the compute node and set to cache for the disk that is mapped to the remote Ceph*, as shown in the figure below.

To avoid affecting the daily operation of China Unicom's massive data retrieval platform, our test is based on a lightweight laboratory environment. The test used a 750GB Intel Optane P4800X SSD with a nominal read and write latency of 10 microseconds.

Performance result summary

IO Type

Avg Latency (us)

P99

P99.9

P99.99

IOPS

Cache mode: Write-Through

Rand Read

52.04

67

114

1598

118046.38

Rand Write

4543.23

6063

8717

12125

3185.91

Rand R/W(7/3)

190.39/4744.42

247/6652

265/8717

347/12256

7371.49/3163.94

Cache mode: Write-Back

Rand Read

51.01

67

103

135

124237.81

Rand Write

118.22

194

273

314

53463.16

Rand R/W(7/3)

76.89/87.68

123/159

169/227

241/289

41032.94/17593.03

 

The data were grouped and compared according to random read, random write, and 7:3 random read and write. A histogram was formed to make the test results more intuitive and easier to understand. In the picture, Ceph refers to the result of directly testing the Ceph remote disk without cache.

While testing the random reads, we looped the FIO test tool 30 times. As the test progressed, the cache hit rate gradually increased from 1.3% in the first few seconds to the final 95%-97%. In this process, the performance is getting better and better. This data uses the best performing data from 30 runs. The following screenshot is the cache hit statistics of OpenCAS software.

As you can see from the data comparison chart above, after adding the cache—whether it is write-through mode or write-back mode—the latency of data reading has been greatly improved. The average latency is reduced from 2006.2us to 52.04us on write-through and 51.01us on write-back, which is about 40 times lower. P99 latency was reduced from 2343us to 67us on both write-through and write-back modes, a reduction of about 97%. At the same time, the number of read requests per second, IOPS increased from 19142.84 to 118046.38 on write-through and 124237.81 on write-back, an increase of about six times.

In the scenario of random write, there is a huge performance gap between different cache modes. In the write-through mode, the data shows no significant improvement in latency or throughput. This is as expected, because in this mode all data are dropped to the remote disk, which is not different from the direct write of remote disk. However, if the write-back mode is used, the performance of write is greatly improved. The average latency and p99 can be reduced by more than 97%, and the number of write requests per second (IOPS) is increased by 21 times.

In the scenario where the random read and write ratio is 7/3, on write-through mode the write latency is only reduced by 12.7% (5438.97 vs. 4744.42), while the read performance is good. The average latency of read request is reduced by about 90% (2030.72 vs. 190.39), P99 is reduced by 89% (2278 vs. 247), effectively suppressing the delay fluctuation rate of read requests. The write-back mode shows excellent performance advantages with read latency reduced by 96.2% (2030.72 vs. 76.89), write latency reduced by 98.3% (5438.97 vs. 87.68), and the number of read and write requests per second increased by about 7.6 times.

Conclusion

Using OpenCAS with an Intel Optane SSD for caching greatly improves the read and write performance of the OpenStack system in certain use cases. If the use case requires data consistency, select write-through mode, which can improve the performance of read-intensive applications. If the use case requires high write performance, or has comprehensive disk read and write performance requirements, but the data consistency requirements are not high, or there are other compensation measures to ensure data consistency, then choose write-back mode. Both read and write performance can be greatly improved. The WoCloud massive data retrieval platform is mainly used in read-intensive use cases and is sensitive to the delay of read requests. This experiment has great reference significance for its tuning decision and platform upgrade.

If you want to try this approach, download this code (which is under review as of 08May2020) and run it in your OpenStack environment:

·         Cinder patch: https://review.opendev.org/#/c/700799/

·         Os-brick patch: https://review.opendev.org/#/c/663549/

·         Nova patch: https://review.opendev.org/#/c/663542/

References

·         https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/shrout-research-cost-benefit-analysis-qc-flash-white-paper.pdf

·         https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html

Disclaimers

Intel, 3D XPoint, and Intel Optane are trademarks of Intel Corporation or its subsidiaries.