Sorry, you need to enable JavaScript to visit this website.

Intel PMEM-Equipped Storage Performance Evaluation of Container Platform

BY Tingjie Chen, Vivian Zhu ON Jun 23, 2021

Preface

Have you encountered any of these scenarios?

  • Your company has just purchased Intel® Optane™ technology products. You have designed a solution to accelerate your business and it works well, but you need a comprehensive performance evaluation to show the data-proven result.
  • You are responsible for maintaining a large storage cluster and would like to improve storage performance using the new cache solution. You need a set of evaluation tools to easily build up the benchmark and prove the performance boost on a typical workload.
  • You are an operations engineer. To improve efficiency, the R&D team has adopted Kubernetes* for elastic deployment. As businesses are transforming to cloud native, the benchmarking tools are also forcefully being containerized. How does the storage system seamlessly integrate with containerized business and benchmarking tools?

This whitepaper describes the storage performance evaluation method for the container platform, which expects to solve similar issues you are facing.

Introduction

As a new type of infrastructure, cloud native technology represented by containers is an important supporting technology for the digital construction of enterprises. Cloud native has gradually emerged in cutting-edge technologies such as artificial intelligence, big data, edge computing, and 5G, and has become the driving force for digital infrastructure.

Ceph* is a distributed open source storage solution with the characteristics of scalability, high performance, and no single point of failure. The program can run on any hardware that meets its requirements. In general, maintaining a Ceph cluster has a certain degree of complexity, and Rook* is born for this. Rook is an Operator that can provide Ceph cluster management capabilities. Rook uses the CRD controller to deploy and manage resources such as Ceph, which can simplify operations and maintenance for Ceph storage.

The test is based on Alauda* ACP3.0 full-stack cloud native open platform. ACP3.0 is a new generation of full-stack hybrid cloud platform built with Kubernetes as the core. ACP includes five product lines and 16 sub-products, covering cloud native infrastructure, cloud native application architecture, cloud native DevOps, and cloud native five platforms for data services and cloud-native digital intelligence. The whole platform adopts the native container architecture, with Kubernetes as the base and control plane. It supports one-click deployment, automatic operation and maintenance, continuous upgrades, and has the characteristics of openness, flexibility, and scalability.

ACP3.0 can claim different types of storage resources through the storage class of Kubernetes, and fully supports Ceph, NFS and other storage systems. The platform also provides a hyper-converged intra-cluster storage solution, called built-in storage. Built-in storage is a highly scalable distributed storage solution that supports block storage and file storage to accommodate small and medium-sized storage needs. The built-in storage adopts the open source Rook storage solution and is deeply customized to implement a distributed storage service that can be automatically managed, expanded, and repaired.

Typical Scenarios

Benchmarking

Micro-benchmark testing programs are widely used in the storage industry such as FIO, VdBench, etc. These micro-benchmark testing tools are flexible in configuration and simple in mode. They can quickly verify individual performance under a certain configuration, such as 4K random read and write, but they cannot reflect the real workload in a complex production environment. Simulating complex benchmark tests fills the gap. It defines a specific workload model, simulates various benchmark scenarios by operating the database, and these simulated behaviors can be reproduced.

In general, micro-benchmark testing and simulated complex benchmark testing systems can complement each other but their usage scenarios are different. Simulated complex benchmark tests are suitable for the following scenarios:

  • Comprehensive evaluation of storage systems. Evaluate the performance ceiling of the system and find bottlenecks and possible defects. This scenario is relatively complicated and is not discussed in this white paper.
  • Introduction of new solutions or important features. For example, replacing new hardware, major version upgrades of storage systems, introducing new solution architectures, or switching the original system to a new system. The whitepaper will focus on two of these scenarios.
  • Optimization of benchmarks for customer requirements. Customers have online business scenarios, and we need to simulate these scenarios for relevant optimization.

To address these scenarios, we introduced CockroachDB* and HammerDB, which are complex benchmark tests for the production environment. We then deployed the tests in a Kubernetes cluster in a containerized manner. According to IDC data, 80% of enterprise-level data is unstructured data, and 75% of them will be stored on object storage. So, we also introduced the CosBench benchmark to meet the strongly increasing demand of object storage on the cloud platform.

Container environments represented by Kubernetes are more and more widely deployed in production environments. This whitepaper provides typical scenarios in container environments, based on new hardware (Intel Optane persistent memory hardware) and new solutions (Open CAS framework) to accelerate the distributed storage system Rook-Ceph and upper-layer business.

The following table briefly introduces a few typical benchmark tests that we selected.

 

Benchmarks

Feature

Workload

Evaluation Criteria

CockroachDB – Build-in database

CockroachDB is a scalable, cross-regional replication database that supports transactions, high availability, and highly-consistent and distributed SQL. It comes with a built-in load generator for simulating different types of client workloads.

Bank:Use currency balance sheets to model a set of accounts.

KV:Read and write key-value pairs uniformly and randomly throughout the cluster.
MovR:Simulate the workload of the MovR sample application. MovR is a fictitious car sharing company that aims to showcase CockroachDB functionality.
TPCC:Use multiple rich patterns to simulate online transaction processing workloads
YCSB:Use custom functions to simulate large-scale read, write, or scan-based large-scale key-value workloads

Op/s: Operations processed per second;
Average waiting time

HammerDB - OLTP

HammerDB is a benchmark and load test suite that supports currently popular Oracle* databases, such as SQL Server, IBM* DB2, MySQL*, MariaDB*, PostgreSQL*, and Redis*. It supports workloads based on TPC-C and TPC-H.

TPC-C is recognized as the most authoritative and complex online transaction processing benchmark in the industry. It tests a wide range of database performance by simulating warehouse and order management systems.

TPC-H is a decision support benchmark, also called OLAP (Online Analytical Processing) benchmark, which consists of a set of business-oriented temporary queries and concurrent data modification.

TPC-C: transactions-per-minute-C (tpmC) and associated price-per-tpmC($/tpmC)
TPC-H: Composite Query-per-Hour Performance Metric (QphH@Size) and associated price-per-QphH($/QphH@Size)

CosBench -
Object Store

CosBench is a distributed benchmark tool used to measure the performance of cloud object storage services. It supports dominated object storage interfaces such as Amazon* S3 and OpenStack* Swift.

The configuration file defines operations such as reading and writing of objects.

Throughput, bandwidth, and response time


Scene and Framework

 

We use Rook to deploy a containerized Ceph cluster. There are three deploy options for the OSD storage service:

  • Baseline: SSD directly used as the BlueStore partition.
  • Reference-1: SSD used as the BlueStore data partition, and metadata (RocksDB and WAL) are deployed on Intel Optane persistent memory devices.
  • Reference-2: SSD and Intel Optane persistent memory devices are combined into a new CAS device through Open-CAS software, which considered as data partition of BlueStore, metadata are deployed on Intel Optane persistent memory devices.

Intel Optane persistent memory can be manually created as a block device, but it brings additional management and maintenance costs. In Kubernetes, the Intel Optane persistent memory resource can be automatically provisioned by CSI (Container Storage Interface) plugin. PMem-CSI (CSI driver for Intel Optane persistent memory) provides container-oriented storage volumes, which can be used as metadata partition of Ceph BlueStore. Ceph can also provide RBD block volumes to container applications through Ceph-CSI. The benchmark deployed in the form of Kubernetes application services is used to evaluate the storage performance improvement with Intel Optane persistent memory.

Open-CAS is an open-source cache acceleration solution developed by Intel. By loading the kernel, the high-speed device is used as a cache, “fused” into one device with the slow one.

Performance Evaluation Environment

Software & Hardware Environment

For the typical scenarios, we built a compatible environment, including hardware and software configuration. On the server side, a Rook Ceph storage cluster provides block and object storage services. On the client side, a set of benchmarking programs is deployed to run workloads.

This environment was built and tested in Alauda Lab.

 

 

Hardware Configuration

  • 3 Servers + 1 Client

 

 

  • CPU

2 * Intel® Xeon® Gold 6252N CPU @ 2.30GHz

1 * Intel® Xeon® Platinum 8280L CPU @2.70GHZ

 

  • Memory

16 * 32 GB

 

  • Intel Optane persistent memory

4 * 128 GB

 

  • SSD

3 * SSD 730 Serious 800 GB

 

  • Network Switch

1 Gb/s

Software Configuration

 

 

 

  • Operating System

CentOS* Linux* release 7.8.2003

 

  • Linux Kernel

Linux version 3.10.0-1160.6.1.el7.x86_64

 

  • Docker*

19.03.9

 

  • Kubernetes

1.18.13

 

  • Ceph

Octopus 15.2.2

 

  • Ceph cluster

3 Server Nodes and 1 Client Node (Share physical node with server)

 

  • Rook

1.3.3

 

  • Open-CAS

20.06.00.00000703

 

  • CockroachDB

20.1.1

 

  • HammerDB

3.3

 

  • MySQL

8.0.19

 

  • Cosbench

0.4.2

Scenes

 

 

 

  • Per-Node

 

 

  • OSD Number

6 (SSD:OSD=1:2)

 

  • RBD Size

200 GB

 

  • Baseline

 

 

  • BlueStore Data Partition

400 GB

 

  • Reference 1

 

 

  • RocksDB + Wal Partition

6 * 80 GB Intel Optane persistent memory (Manually created)

 

  • Reference 2

 

 

  • Cache Device

6 * 20 GB Intel Optane persistent memory (PMem-CSI)

Benchmarks

 

 

 

  • CockroachDB

 

 

  • Modes

Bank, MOVR和YCSB with drop option

 

  • Test Duration

300 seconds

 

  • HammerDB

 

 

  • Workload

TPC-C with VU=8

 

  • Test Duration

300 seconds

 

  • Cosbench

 

 

  • Workload

Synthetic with 70% Read and 30% Write

 

  • Containers

100

 

  • Objects

500

 

  • Workers

18

 

  • Object size

4 KB

 

  • Test Duration

1200 seconds

 

Rook-Ceph Storage Cluster

We deployed a three-node Kubernetes cluster. Rook operator deployed Ceph containerized modules on each node.

Rook-Ceph also supports deployment on two networks; the configuration of the public network and the cluster network can be configured by ConfigMap override or by defining the CephCluster yaml configuration. We adopted the first method.

The benchmarks are deployed in the form of StatefulSet Pods:

  • For block storage, applications define PVC (PersistentVolumeClaim) and request resources with Ceph-CSI.
  • For object storage, applications obtain objects from the Ceph object Gateway in the form of S3/Swift API.

It should be noted that there are two ways to configure the OSDs of the Rook-Ceph cluster:

  • Directly define the device path on each node. This method is simple and suitable for lightweight deployment, but it has a scalability issue and does not support public cloud storage.
  • Using the PVC method first define a PVC template, then fill in the StorageClass required by each OSD, which has scalability and supports various local and remote storage types.

In the configuration file of CephCluster.yaml, we defined the BlueStore partitions with data, block.db and block.wal. Each partition corresponds to a type of StorageClass. In this case, we have created multiple PVs (persistent volumes) through the built-in type local-storage, and the PVs can be selected by the appropriate size and bind. The normal loading process requires two conditions:

  • CSI services, including plugins and drivers, are running normally
  • The corresponding StorageClass has been created: rook-ceph-block, where the provisioner configuration item is rook-ceph.rbd.csi.ceph.com

Performance Data and Comparative Analysis

Test Report

Workload

Unit

Baseline

Reference 1

Reference 2

Cockroach Bank

Ops/sec

504.4

648.8

1494.5

Latency (ms)

437.8

340.9

149.7

Cockroach MOVR

Ops/sec

233.2

248.3

258.8

Latency (ms)

4.3

4.0

3.9

Cockroach YCSB

Ops/sec

50766.5

51287.7

57635.6

Latency (ms)

4.4

4.4

3.9

Hammerdb mysql TPCC (vu8)

TPM

7718

8535

10578

NOPM

2544

2844

3525

Cosbench synthetic
4K - Object Size
(70% Read - 30% Write)

Read Response Time (ms)

4.27

4.36

4.35

Read Throughput (op/s)

2589.33

2813.52

3198.58

Write Response Time (ms)

18.8

16.25

13.07

Write Throughput (op/s)

1107.46

1207.04

1371.9

 

 

From the test report, we can see the performance of Reference-2 is the best, the second best is Reference-1, and Baseline is the worst, which is in line with our expectations. (The three configurations refer to the Scene and Frameworksection).

In terms of throughput, Reference-2 can achieve 2.96 times that of Baseline with the most improvement, and average latency can be reduced to 34.2% of Baseline at most.

The following two graphs respectively show the comparison between the original data histogram and the normalized histogram.

 

 

Conclusion

In this whitepaper, we built a Rook-Ceph storage cluster. We used Intel Optane persistent memory to accelerate the metadata of the Ceph storage node and the Open-CAS cache framework to accelerate the data partition on container platform. We introduced benchmark testing tools that simulate the real production environment, including CockroachDB, HammerDB TPC-C and CosBench to evaluate the performance of Ceph block storage service and object storage service. In our evaluation, under different workload models, we found different levels of improvement. The maximum throughput could be increased by 2.96 times and the average latency can be reduced by 65.8%.

Kubernetes is becoming the standard infrastructure for the cloud native application industry. Comprehensive benchmarking tools are an important part of simulating real environments. They are an effective supplement to conventional micro-benchmark tests when evaluating the performance of a cloud native environment. This whitepaper depicts an effective and comprehensive performance evaluation in specific scenarios, especially for Online data analysis, order system, and other real applications, which can be widely applied.

Reference

Rook GitHub:https://rook.io/ https://github.com/rook/rook

CockroachDB workload:https://www.cockroachlabs.com/docs/stable/cockroach-workload.html

TPC:http://www.tpc.org/

HammerDB:https://hammerdb.com/

CosBench:https://github.com/intel-cloud/cosbench

Appendix

Benchmarks and Workloads

Storage Workload

For a storage system, there are several criteria for evaluating performance:

  • IOPS: The number of IO requests processed per second.
  • Throughput (MB/s): The amount of read and write data per second.
  • Response time/Waiting time: the time delay for processing a request, including the time from sending the request to receiving the response.

For storage media, operations include read, write, and some filesystem metadata operations. Reading and writing are divided into random access and sequential access, according to the access mode of address space. The logical address/physical address of random access is unpredictable.

The block size of a request also impacts the performance. General speaking, requests for small blocks are more concerned with IOPS or response time, and requests for large blocks are based on throughput.

Another factor is queue depth, which is the number of I/O requests waiting in the queue. The IOPS will not increase with the increase of queue depth; it will decrease after reaching a peak point. The waiting time for IO requests increases as the queue depth increases.

According to the request block size and access mode, the storage workload can be divided into several typical types:

  • OLTP (online transaction processing): a transactional system, mainly consisting of small transactions and small queries. The load is based on small random writes (random accounting for 80%). Non-random processes (read 25% and write 75%), are used to evaluate the response time.
  • OLAP (online analytical processing): supports complex analysis operations, focuses on decision analysis, and provides intuitive and easy-to-understand query results. The load type is mainly large block size read to evaluate throughput.
  • Video data collection: The load type is mainly written in bulk order.
  • Virtual desktop infrastructure: The load type is mainly small random read and write (read 50% and write 50%).
  • Internet of things. The load type is mainly written in the order of mixed size blocks.

The real-world workload is much more complicated than these typical models. We choose complex benchmarks to provide a more realistic simulation of the workload in the production environment and to guide us in the evaluation and transformation of the system.

Comprehensive workload -- CockroachDB

CockroachDB is a distributed SQL database that is scalable, replicates across regions, and supports transactions, high availability, and high consistency. It comes with a built-in load generator for simulating different types of client workloads. A few common workloads are listed below:

  • Bank (Bank): Models a set of accounts using a currency balance table.
  • KV (Key-Value): Reads and writes key-value pairs uniformly and randomly across the cluster.
  • MovR: Simulates the workload of the MovR sample application, which is a fictional vehicle-sharing company designed to demonstrate CockroachDB functionality. The dataset for this workload contains six database tables and the simulated operations are scaled as shown in the following figure.

 

 

  • TPCC: Simulate online transaction processing workloads using multiple rich charts.
  • YCSB: Simulate large read, write, or scan-based large-scale key-value workloads using other custom features.

CockroachDB loads are still growing; these built-in workloads are simple to operate, rich in patterns, and one of the best choices for comprehensive and complex benchmarking for rapid validation.

OLTP and OLAP -- HammerDB

HammerDB is a benchmarking and load testing suite that supports the current popular Oracle databases, SQL Server, IBM DB2, MySQL, MariaDB, PostgreSQL, and Redis. It supports TPC-C and TPC-H based workloads. The TPC-C/TPC-H supported by HammerDB is slightly different from the standard TPC-C/TPC-H test dataset, which is more flexible and lightweight.

TPC (transaction processing performance committee)  is responsible for defining transaction processing and database performance benchmarks, such as TPC-C, TPC-H, TPC-W and other benchmarks, and publishing objective performance data based on these benchmarks.

TPC-C is the system industry standard for measuring OLTP  and is recognized as the most authoritative and complex online transaction processing benchmark test in the industry. It tests a wide range of database performance by simulating warehouse and order management systems, including queries, updates, and queues that are small batch operations. The TPC-C benchmark test evaluates a simulated order entry business transaction per minute (tpmC) throughput.

The TPC-C test model is a large wholesale sales company, with operations in multiple regions spread across the land using warehouse management. When the business expands, the company adds new warehouses, each responsible for supplying 10 regions, each with 3,000 customer services, and each maintaining inventory records for 100,000 items.

It contains a mixture of five different types and complexities of concurrent transactions that can be executed online or queued for delayed execution. These five types of transactions include:

  • New Order: Customers enter a new order transaction, accounting for 45% of transactions.
  • Payments: Account balances are updated to reflect their payment status, accounting for 43%.
  • Delivery: Bulk transactions, accounting for 4%.
  • Order Status: Query the status of the customer's most recent transaction, accounting for 4%.
  • Inventory Level: Check the inventory status of the warehouse for timely replenishment, accounting for 4%.

TPC-H is a decision support benchmark, also called OLAP benchmark, which was developed on and replaced TPC-D. TPC-H simulates database operations in decision support systems and tests the response time of complex queries of database systems, using the number of queries executed per hour (TPC-H QphH@Siz) as a metric.

In the TPC-H model, eight tables, 22 complex queries (SELECT), and two update (program segments with INSERT and DELETE) operations are defined. The data volume of the tested database has eight levels from 1GB to 10000GB for users to choose. When testing, 22 queries will be randomly composed of query streams and two update operations will be composed of an update stream. Query streams and update streams concurrently perform data road access; the number of query streams increases with the amount of data.

There are two kinds of database applications in general: online transaction processing, which are represented by TPC-C and data mining/online analysis processing, which are represented by TPC-H. The results of TPC-C have some reference value for database systems, whereas banking, securities, tax filing systems, e-commerce websites, and telecommunication business are more typical online transaction processing applications. The results of TPC-H are aimed at decision analysis and have general, commercial, and practical significance. They are currently widely used in bank credit analysis and credit card analysis, telecom operation analysis, tax analysis, and tobacco industry decision analysis.

Benchmark Test Profile

HammerDB

The HammerDB benchmark is divided into two parts: the HammerDB test suite and the database (in this case MySQL). A TCL script was defined to automate the command line configuration and execution of the corresponding workloads.

#!/usr/bin/tclsh

proc runtimer { seconds } {

set x 0

set timerstop 0 

while {!$timerstop} {      

 incr x

 after 1000

  if { ![ expr {$x % 60} ] } {

  set y [ expr $x / 60 ]

  puts "Timer: $y minutes elapsed"

  }

 update

 if {  [ vucomplete ] || $x eq $seconds } { set timerstop 1 }

    }

return

}

puts "SETTING CONFIGURATION"

dbset db mysql

dbset bm tpc-c

diset tpcc mysql_driver timed

diset tpcc mysql_rampup 0

diset tpcc mysql_duration 1

vuset logtotemp 1

loadscript

puts "SEQUENCE STARTED"

foreach z { 1 2 4 8 16 32 64 } {

puts "$z VU TEST"

vuset vu $z

vucreate

vurun

runtimer 300

vudestroy

after 5000

}

puts "TEST SEQUENCE COMPLETE"

 

CosBench

The CosBench benchmark is based on Rook Ceph's Rados Gateway, as for the detailed way to set up the Gateway under Rook, please refer to: https://rook.io/docs/rook/v1.3/ceph-object.html

CosBench supports the Web UI method and the workload configuration file in xml format can be uploaded directly to automatically load and run. We use Cosbench’s configuration file to define the 4K and 1M object size; the 4K configuration file is shown below:

<?xml version="1.0" encoding="UTF-8" ?>

<workload name="Synthetic-4K" description="4K Synthetic workload">

  <auth type="none" config="username=admin;password=nexenta1;auth_url=http://10.105.110.57"/>

  <storage type="s3" config="accesskey=XVOGCC5XL3M70HUBL3NR;secretkey=Mb7BFMZE5HEz7RNiO9WMtqwYZJPbEN1LH050SqQn;endpoint=http://10.105.110.57:80;timeout=600000"/>

  <workflow>

    <workstage name="init">

      <work type="init" workers="1" config="containers=r(1,100)" />

    </workstage>

    <workstage name="prepare">

      <work type="prepare" workers="10" config="containers=r(1,100);objects=r(1,500);sizes=c(4)KB" />

    </workstage>

    <workstage name="read/write">

      <work name="R/W" workers="32" runtime="1200">

        <operation type="read" ratio="70" config="containers=u(1,100);objects=u(1,250)" />

        <operation type="write" ratio="30" config="containers=u(1,100);objects=u(251,500);sizes=c(4)KB" />

      </work>

    </workstage>

    <workstage name="cleanup">

      <work type="cleanup" workers="18" config="containers=r(1,100);objects=r(1,500)" />

    </workstage>


    <workstage name="dispose">

      <work type="dispose" workers="18" config="containers=r(1,100)" />

    </workstage>

  </workflow>

</workload>

Rook-Ceph OSD Configuration with PVC

The sample configuration file is shown below:

volumeClaimTemplates:

- metadata:

    name: data

  spec:

    resources:

      requests:

        storage: 64Gi

    # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, gp2)

    storageClassName: local-ssd

    volumeMode: Block

    accessModes:

      - ReadWriteOnce

- metadata:

    name: metadata

  spec:

    resources:

      requests:

        storage: 5Gi

    # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1)

    storageClassName: local-pmem

    volumeMode: Block

    accessModes:

      - ReadWriteOnce

- metadata:

    name: wal

  spec:

    resources:

      requests:

        storage: 5Gi

    # IMPORTANT: Change the storage class depending on your environment (e.g. local-storage, io1)

    storageClassName: local-pmem

    volumeMode: Block

    accessModes:

      - ReadWriteOnce