Sorry, you need to enable JavaScript to visit this website.

The Opportunities and Gaps of Data Plane Development Kit-Backed Crimson-Ceph*-OSD Solution

BY Yingxin Cheng ON Dec 07, 2020

Background

Crimson [1] is the code name of Crimson-Ceph-OSD, which is the next-generation Ceph-OSD. The project goal is to get the optimal performance with modern computer hardware, including the contemporary multi-core CPU with NUMA and fast network and storage devices. With the increasing I/O capabilities, CPU becomes the new bottleneck. Traditional software based on multi-threads, shared data, and lock-based synchronization doesn’t scale well with multi-core CPUs. It is also a thorny problem of Ceph-OSD, which has a history of more than 10 years, and is based on the traditional multi-threaded software architecture.

To address this issue, the Ceph community is looking forward to a promising framework called Seastar [2]. Seastar provides an advanced new model with shared-nothing design, explicit cross-core message passing, and user-space task scheduling for concurrent C++ applications. It aims to be friendly for developers to build up new solutions that can be easily scaled out with CPU cores. Seastar has proven examples [3] of Pedis, ScyllaDB*, Seastar-HTTPD, and Seastar-Memcached. These convinced the Ceph community to rewrite Ceph-OSD with the new framework.

The efforts are huge; however, there are components that have been refactored and are functioning, which can be used to do early stand-alone evaluations. Crimson-messenger [4] has been successfully rewritten with Seastar. It is already fulfilling Ceph Messenger V2 protocol and supporting further developments in Crimson-OSD. For simplicity, the current version is completely developed and tested based on Seastar kernel-based interfaces called POSIX stack [5], which is based on Linux* standard socket APIs. Seastar claims that its applications can work on another user-space stack without any code change. That network stack is called the Native TCP/IP Stack [6], which is implemented by the Data Plane Development Kit (DPDK) with its custom TCP/IP implementations.

This article presents an early evaluation of the opportunities of Seastar native stack (with DPDK) from a performance point of view, and summarizes the major gaps and efforts left for Crimson-OSD to be able to adopt it.

Test environment

Available servers:

  • sceph1: Ubuntu* 18.04(4.15.0-55); 2x CPU (22 cores each); 2x NIC A (10GbE); 
  • sceph2: Ubuntu 19.04(5.4.0-rc5+); 2x CPU (22 cores each); 2x NIC A (10GbE); 4x NIC B (10GbE);
  • sceph3: Ubuntu 18.04(4.15.0-66); 2x CPU (22 cores each); 2x NIC A (10GbE); 4x NIC B (10GbE);

 

Environment setup:

  • Configure BIOS, kernel params/drivers, hugepages (4G), and NIC binding, following the official DPDK setup guide [7].
  • Build Ceph; use make target “perf_crimson_msgr”, with compile options “WITH_SEASTAR=ON, Seastar_DPDK=ON, WITH_TESTS=OFF”.

 

OS, NIC, Driver combinations, and results:

  1. Ubuntu 18.04 + NIC A + vfio/uio (perf_crimson_msgr connected)
  2. Ubuntu 19.04 + NIC A + vfio/uio (perf_crimson_msgr cannot connect)
  3. Ubuntu 19.04 + NIC B + vfio/uio (perf_crimson_msgr cannot connect)
  4. Ubuntu 18.04 + NIC B + vfio/uio (perf_crimson_msgr cannot connect)

 

The test program perf_crimson_msgr failed with the above combinations 2-4 at various steps, such as NIC initialization or simply when the network is not reachable. We didn’t dig further into why the issue happened and focused on the only working combination 1 to do the following performance tests with sceph1 and sceph3.

Native stack vs POSIX stack performance

Test scenarios:

  1. Local (multi-thread, sceph1): process started, devices are up, but client/server cannot be connected;
  2. Local (multi-process, sceph1): both client and server are DPDK primary processes that cannot coexist (Cannot create lock on '/var/run/dpdk/rte/config’);
  3. Remote (sceph1 <-> sceph3): working but with bugs (see the section Native stack caveats below);

 

The only working test scenario is 3, so we did the following performance tests remotely with client running in sceph1 and server running in sceph3.

 

Test commands (perf_crimson_msgr [8]):

$ perf_crimson_msgr --dhcp=0 --network-stack=native --dpdk-pmd --poll-mode
   --host-ipv4-addr=192.168.122.3 --addr=v2:192.168.122.2:9010 --mode=1
   -depth=<depth> --jobs=<jobs> -c <jobs+1>
$ perf_crimson_msgr --dhcp=0 --network-stack=native --dpdk-pmd --poll-mode
   --host-ipv4-addr=192.168.122.2 --addr=v2:192.168.122.2:9010 --mode=2
   --sbs=4096 -c 1
  • Client (sceph1):
  • Server (sceph3):
  • Command parameters:
  • Client send block size = 4KiB;
  • Server reply block size = 4KiB;
  • Timing: 5s warm-up, 15s runtime;
  • Seastar poll-mode is always enabled;
  • Client cores is equal to the number of client jobs;
  • Server cores is always 1; 
  • Each job/messenger is running on a dedicated core/shard;

 

 

Test case 1: performance by 1-8 client jobs; total depth is always 512

Client jobs

1-job 

2-jobs

4-jobs

8-jobs

IOPS (native)

138382

165476

192810

207301

IOPS (posix)

134409

133641

130181

123784

Latency-ms (native)

3.66408

3.07243

2.64474

2.45295

Latency-ms (posix)

3.48638

3.63574

3.77127

4.05033

L=λW (native)

507.042719

508.413427

509.932319

508.498988

L=λW (posix)

468.600849

485.883929

490.9477

501.366049

 

 

Highlights:

  1. Native stack is scalable across cores because performance is improved by adding more cores/jobs.
  2. POSIX stack performance is not scalable and is decreased by adding more cores/jobs.

 

 

Test case 2: performance by 1-1024 depth; total client job is always 1

depth

1

2

4

8

16

32

64

256

512

1024

IOPS (native)

2001.08

4001.63

35869.6

17106.1

94790.5

169963

191042

136808

138382

170527

IOPS (posix)

3874.88

7736.05

13748.7

29309.7

52116.7

80300.1

107483

123566

134409

131852

Latency-ms (native)

0.49865

0.49881

0.10955

0.45826

0.15910

0.17623

0.31433

1.8518

3.66408

5.99559

Latency-ms (posix)

0.25672

0.25723

0.28804

0.26254

0.28470

0.36030

0.49671

1.80443

3.48638

7.48321

L=λW (native)

0.99784

1.99608

3.92972

7.83919

15.0814

29.9534

60.0511

253.341

507.042

1022.40

L=λW (posix)

0.99477

1.98999

3.96025

7.69499

14.8380

28.9326

53.3880

222.966

468.600

986.676

 

Highlights:

  • Native stack generally has better performance than POSIX stack.
  • Native stack performance pattern is unexpectedly high under some depth settings (32, 64) and the actual IOPS/throughput numbers are not stable at runtime, see [9].
  • POSIX stack’s performance pattern doesn’t have similar spikes and unstable IOPS with the same test configurations.

 

 

CPU metrics collected from server-side (1 core, server poll-mode enabled)

  1. Test case 1: performance by 1-8 client jobs, total depth is always 512

client-jobs (depth=512)

1

2

4

8

cycles-per-op (Native)

21177.35

18641.06

16925.36

16816.08

cycles-per-op (POSIX)

24446.85

23149.13

22601.01

23277.84

IPC (Native)

1.83

1.78

1.69

1.65

IPC (POSIX)

1.18

1.22

1.24

1.22

branch-miss (Native)

0.35%

0.33%

0.38%

0.41%

branch-miss (POSIX)

0.75%

0.74%

0.70%

0.70%

 

 

  1. Test case 2: performance by 64-1024 depth, total client job is always 1

Depth (client-jobs=1)

64

256

512

1024

cycles-per-op (Native)

17634.49

23081.26

21177.35

19117.91

cycles-per-op (POSIX)

32774.89

26057.14

24446.85

23653.66

IPC (Native)

1.64

1.87

1.83

1.77

IPC (POSIX)

0.99

1.11

1.18

1.2

branch-miss (Native)

0.37%

0.25%

0.35%

0.35%

branch-miss (POSIX)

0.80%

0.75%

0.75%

0.75%

 

 

Highlights:

  • For each operation (send and reply round-trip), the average CPU cycles from Native stack is less than POSIX stack, implying that Native stack may require less computation efforts.
  • IPC (Instructions per cycle) is much higher with Native stack, indicating more efficient CPU usage.
  • Branch-miss is much lower with Native stack, result in better CPU piplining, including better IPC results.

Native stack caveats

In this section, we present gaps and issues we’ve identified during experiments and according to the requirements from Crimson-OSD.

 

Feature gaps

  1. For multiple OSD instances in the same host, Seastar cannot establish a loopback connection under native stack.
  2. We cannot start multiple OSD instances in different processes because the Seastar native stack currently doesn’t support the DPDK secondary process mode [10]. We may need to implement packet forwarding across DPDK processes in native stacks or use DPDK IPC API. Alternatively, we need to start multiple logic OSD instances in the same process.
  3. For multi-core OSD, the hack to move sockets across cores in the current Crimson-Messenger no longer works with native stack (see https://github.com/ceph/ceph/blob/4589fff6bff8dadd7347fccfc62ed4a49e2b101d/src/crimson/net/SocketMessenger.cc#L172-L173). The reason is that Seastar native stack has its own lower-level (l2/l3/l4) network implementations, and the socket placement is definitive according to hardware RSS (Receive Side Scaling) and TCB (Transmission Control Blocks) placement. This means we cannot simply move the native ConnectedSocket to another core. We can only choose to introduce cross-core communication or adjust our connect/bind/accept strategy.
  4. For OSD multiple network support (public network and cluster network), the Seastar native stack can only support one NIC from its implementation (see https://github.com/scylladb/seastar/blob/453d531b09f9f385719f6d87646294bea71f0ea2/src/net/native-stack.cc#L82-L84), while an OSD needs at least two NICs.
  5. For zero-copy native stack, we need to enable HugetlbfsMemBackend for dpdk_qp. Currently we introduce data copy at both rx/tx IO paths: Send (https://github.com/scylladb/seastar/blob/33406cfe146f19084c96b65c6fe3097e12ca3242/src/net/dpdk.cc#L1249-L1261) and Receive (https://github.com/scylladb/seastar/blob/33406cfe146f19084c96b65c6fe3097e12ca3242/src/net/dpdk.cc#L1983-L2084);

 

Stability issues

  1. On the Messenger side, perf_crimson_msgr frequently warns about “exceptional futures ignored”, which doesn’t happen with POSIX stack. There are some FIXMEs in native stack implementations that ignore the returned future. That could explain why the exception could be mistakenly ignored in our case.
  2. perf_crimson_msgr performance with Native stack is unstable (see [9] IOPS and MB/s).
  3. perf_crimson_msgr is failed to work with too many jobs or IO depth, which doesn’t happen with POSIX stack.

 

 

References

[1] https://docs.ceph.com/docs/master/dev/crimson/

[2] http://seastar.io/

[3] http://seastar.io/seastar-applications/

[4] https://github.com/ceph/ceph/tree/master/src/crimson/net

[5] http://seastar.io/networking/

[6] https://github.com/scylladb/seastar/blob/master/doc/native-stack.md

[7] https://doc.dpdk.org/guides/linux_gsg/index.html

[8] https://github.com/ceph/ceph/blob/master/src/tools/crimson/perf_crimson_msgr.cc

[9]

<network=Native, jobs=1, depth=4>

    sec depth    IOPS     MB/s        lat(ms)

1.00003     4 50544.4 197.439  0.0774906

1.00004     4 32345.7 126.351  0.121795

1.00002     4 30840.4 120.47    0.127731

1.00003     4 49668.7 194.019  0.0786691

1.00002     4 23040.6 90.0022  0.17081

1.00002     4 40320.3 157.501  0.0972716

1.00003     4 26096.3 101.939  0.151315

1.00002     4 21366.6 83.4632  0.184859

1.00009     4 41436.4 161.861  0.0947345

1.00005     4 15438.2 60.3055  0.255225

-------------- summary --------------

10.0004     - 33109.7 129.335   0.118771

 

1.00005     4 36371.3 142.075  0.108006

1.00002     4 23791.5 92.9357  0.165738

1.00002     4 47706.1 186.352  0.0820904

1.00002     4 48898.1 191.008  0.0801385

1.00002     4 50180.1 196.016  0.0780173

-------------- summary --------------

15.0005     - 35869.6 140.116  0.109556

 

<network=POSIX, jobs=1, depth=4>

    sec depth    IOPS    MB/s lat(ms)

1.00008     4 14575.9  56.937    0.271158

1.00006     4 13738.2  53.665    0.289145

1.00003     4 14073.6  54.9748  0.28194

1.00004     4 13711.4  53.5603  0.287719

1.00007     4 13802     53.9142  0.286981

1.00002     4 13614.7  53.1824  0.291373

1.00003     4 13629.6  53.2405  0.290671

1.00002     4 13307.7  51.9833  0.298269

1.00004     4 13889.4  54.2556  0.285798

1.00004     4 13325.5  52.0527  0.296834

-------------- summary --------------

10.0004     -  13766.8  53.7765  0.287805

 

1.00003     4 12884.6  50.3303  0.307303

1.00003     4 13963.6  54.5453  0.283052

1.00003     4 13550.6  52.932    0.291797

1.00004     4 14305.5  55.8807  0.276985

1.00002     4 13856.7  54.1278  0.285309

-------------- summary --------------

15.0006     - 13748.7   53.7057  0.288046

[10] https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html