Sorry, you need to enable JavaScript to visit this website.

Linux Kernel Performance

Linux development evolves rapidly. The performance and scalability of the OS kernel has been a key part of its success. However, discussions have appeared on LKML (Linux Kernel Mailing List) regarding large performance regression between kernel versions. These discussions underscore the need for a systematic and disciplined way to characterize, improve, and test Linux kernel performance. Our goal is to work with the Linux community to further enhance the Linux kernel with consistent performance increases (avoiding degradations) across releases. The information available on this site gives community members better information about what 0-Day and LKP (Linux Kernel Performance) are doing to preserve performance integrity of the kernel.

0-DAY CI LINUX KERNEL PERFORMANCE REPORT (V4.18)

BY Philip Li ON Sep 12, 2018
  1. Introduction

0-Day CI is an automated Linux kernel test service that provides comprehensive test coverage of the Linux kernel. It covers kernel build, static analysis, boot, functional, performance and power tests. This report shows the recent observations of kernel performance status on IA platform based on the test results from 0-Day CI service. It is structured in the following manner:

  • Section 2, merged regressions and improvements in v4.18 release candidates

  • Section 3, captured regressions and improvements by shift-left testing during developers’ and maintainers’ tree during v4.18 release cycle

  • Section 4, test machine list

 

  1. Linux Kernel v4.18 Release

The v4.18 release of the Linux kernel was on Aug 12rd. Some of the significant features in this release include unprivileged filesystem mounts, restartable sequences, a new zero-copy TCP receive API, support for active state management for power domains, the AF_XDP mechanism for high-performance networking, the core bpfilter packet filter implementation, and more. 0-Day CI monitored the release closely to trace down the performance status on IA platform. 0-Day CI observed 7 regressions and 4 improvements during feature development phase for v4.18. We will share more detailed information together with correlated patches that led to the results. Note that the assessment is limited by the test coverage we have now. The list is summarized in the observation summary section.

  1. will-it-scale.per_process_ops

will-it-scale runs a test case from 1 to n parallel copies to see if the test case will scale. It builds both a process and threads based test in order to see any differences between the two.

  1. scenario: test page fault in process mode


 

Commit b12e358328 was reported to have 57.6% improvement of will-it-scale.per_process_ops when comparing to v4.17-rc2." It was merged to mainline at v4.18-rc1." the patch implements a event rate limit mechanism to limit event generation frequency in cgroup.

Commit 230671533d was reported to have +135.1% improvement of will-it-scale.per_process_ops when comparing to v4.17." It was merged to mainline at v4.18-rc1." this patch aims to address an issue in current memory.low semantics, which makes it hard to use it in a hierarchy, where some leaf memory cgroups are more valuable than others.

The two commits have restored the performance to pre-v4.16 level.

 

Correlated commits

b12e358328

cgroup: Limit event generation frequency

branch

linus/master

report

[LKP] [lkp-robot] [cgroup] b12e358328: will-it-scale.per_process_ops 57.6% improvement

status

merged at v4.18-rc1

 

230671533d

mm: memory.low hierarchical behavior

branch

linus/master

report

[LKP] [lkp-robot] [mm] 230671533d: will-it-scale.per_process_ops +135.1% improvement

status

merged at v4.18-rc1

 

  1. netperf.Throughput_total_tps

Netperf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirectional throughput, and end-to-end latency.


 

  1. scenario: ipv4 TCP CRR test in localhost

 

Commit 050e9baa9d was reported to have -5.6% regression (FYI) of netperf.Throughput_total_tps when comparing to v4.17." It was merged to mainline at v4.18-rc1."

the patch is from Linus Torvalds and renames CC_STACKPROTECTOR[_STRONG] config variables,  as Linus said, “the overhead of the strong stackprotector is a bit sad”, It seems no plan to fix the regression.

 

Correlated commits

050e9baa9d

Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables

branch

linus/master

report

[LKP] [lkp-robot] [Kbuild] 050e9baa9d: netperf.Throughput_total_tps -5.6% regression (FYI)

status

The patch was merged at v4.18-rc1, no plan to fix the regression because of the nasty code to load the stack canary from a cacheline that has absolutely nothing else in it.

 

  1. aim7.jobs-per-min

aim7 is a traditional UNIX system level benchmark suite that is used to test and measure the performance of a multiuser system.

  1. Scenario: btrfs-creat-clo

 

Commit f7e9e8fc79 was reported to have 55.5% improvement of aim7.jobs-per-min when comparing to v4.17-rc7." It was merged to mainline at v4.18-rc1." the patch aims to stop creating orphan items for truncate.

 

Correlated commits

f7e9e8fc79

Btrfs: stop creating orphan items for truncate

branch

linux-next/master

report

[LKP] [lkp-robot] [Btrfs] f7e9e8fc79: aim7.jobs-per-min 55.5% improvement

status

It’s a fixing patch and merged at v4.18-rc1

 

  1. Observation Summary

0-Day CI observed 7 regressions and 4 improvements during feature development phase for v4.18, which is in the time frame from v4.18-rc1 to v4.18 release.

Test Indicator

Report

Test Scenario

Development Base

Status

aim7.jobs-per-min

[Btrfs] f7e9e8fc79: 55.5% improvement

disk: 1BRD_48G

fs: btrfs

test: creat-clo

load: 4

cpufreq_governor: performance

v4.17-rc7

merged at v4.18-rc1, fixed in v4.18-rc5

aim7.jobs-per-min

[brd] 316ba5736c: -11.2% regression

disk: 1BRD_48G

fs: btrfs

test: disk_rw

load: 1500

cpufreq_governor: performance

v4.17-rc4

merged at v4.18-rc1, the author is working on to fix it

aim7.jobs-per-min

[MD] 5a409b4f56: -27.5% regression

disk: 4BRD_12G

md: RAID1

fs: xfs

test: sync_disk_rw

load: 600

cpufreq_governor: performance

v4.17-rc1

merged at v4.18-rc1, the author is working on to fix it

fio.latency_2ms%

[xfs] b027d4c97b: +7.1% regression

disk: 2pmem

fs: xfs

mount_option: dax

runtime: 200s

nr_task: 50%

time_based: tb

rw: randwrite

bs: 4k

ioengine: libaio

test_size: 200G

cpufreq_governor: performance

v4.17-rc4

merged at v4.17-rc4, it’s not a regression

netperf.Throughput_total_tps

[Kbuild] 050e9baa9d: -5.6% regression (FYI)

ip: ipv4

runtime: 300s

cluster: cs-localhost

nr_threads: 1

cpufreq_governor: performance

v4.17

merged at v4.18-rc1 the regression is not from the commit itself, no plan to fix the regression because of the nasty code to load the stack canary from a cacheline that has absolutely nothing else in it.

stress-ng.clone.ops_per_sec

[netns] a3498436b3: 77.5% improvement

nr_threads: 100%

testtime: 1s

class: scheduler

cpufreq_governor: performance

v4.17-rc2

merged at v4.18-rc1

stress-ng.pthread.ops_per_sec

[mm] f3c01d2f3a: -6.9% regression

nr_threads: 100%

testtime: 60s

class: scheduler

cpufreq_governor: performance

v4.17

merged at v4.18-rc1, no response


 

will-it-scale.per_process_ops

[fs] 3deb642f0d: -8.8% regression

test: poll2

cpufreq_governor: performance

v4.17-rc3

merged at v4.18-rc1, the author is in progress to fix it

will-it-scale.per_process_ops

[fs] 9965ed174e: -3.7% regression

test: poll2

cpufreq_governor: performance

v4.17-rc3

merged at v4.18-rc1, no response

will-it-scale.per_process_ops

[cgroup] b12e358328: 57.6% improvement

nr_task: 100%

mode: process

test: page_fault3

cpufreq_governor: performance

v4.17-rc2

merged at v4.18-rc1

will-it-scale.per_process_ops

[mm] 230671533d: +135.1% improvement

nr_task: 100%

mode: process

test: page_fault3

cpufreq_governor: performance

v4.17

merged at v4.18-rc1


 

  1. Shift-Left Testing

Beyond testing trees in the upstream kernel, 0-Day CI also tests developers’ and maintainers’ trees, which can catch issues earlier and reduce wider impact. We call it “shift-left” testing. During the v4.18 release cycle, 0-Day CI had reported 8 major performance regressions and 8 major improvements by doing shift-left testing. We will share more detailed information together with possible code changes that led to this result for some of these, though the assessment is limited by the test coverage we have now. The whole list is summarized at report summary section.

  1. aim7.jobs-per-min

aim7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of a multiuser system.

 

  1. scenario: sync_disk_rw test on btrfs

 

Commit a9d3e24b6e was reported to have 19.9% improvement of aim7.jobs-per-min when comparing to v4.18-rc1."

 

Correlated commits

a9d3e24b6e

btrfs: always wait on ordered extents at fsync time

branch

linux-next/master

report

[LKP] [lkp-robot] [btrfs] a9d3e24b6e: aim7.jobs-per-min 19.9% improvement

status

Not merged yet

 

  1. netperf.Throughput_total_tps

Netperf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirectional throughput, and end-to-end latency.

  1. scenario: ipv4 TCP CRR test in localhost

 

Commit 8a2e54b8af was reported to have -26.1% regression of netperf.Throughput_total_tps when comparing to v4.18-rc1."

 

Correlated commits

8a2e54b8af

vfs: Implement a filesystem superblock creation/configuration context

branch

linux-next/master

report

[LKP] [lkp-robot] [vfs] 8a2e54b8af: netperf.Throughput_total_tps -26.1% regression

status

Not merged yet, no response

 

  1. stress-ng.mq.ops_per_sec

stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.

  1. scenario: 100%-1s-scheduler-performance

Commit 5cd6a50ace was reported to have -100% regression of stress-ng.mq.ops_per_sec when comparing to v4.18-rc1."

 

Correlated commits

5cd6a50ace

ipc: Convert mqueue fs to fs_context

branch

linux-next/master

report

[LKP] [lkp-robot] [ipc] 5cd6a50ace: stress-ng.mq.ops_per_sec -100% regression

status

Not merged yet, the author is fixing it.

 

  1. unixbench.score

UnixBench is a system benchmark to provide a basic indicator of the performance of a Unix-like system.

  1. scenario: syscall test

Commit 8fbedc19c9 was reported to have -10.2% regression of unixbench.score when comparing to v4.18-rc1."

 

Correlated commits

8fbedc19c9

fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

branch

hch-vfs/remove-get-poll-head

report

[LKP] [lkp-robot] [fs] 8fbedc19c9: unixbench.score -10.2% regression

status

Not merged yet, no response

 

  1. Report Summary

0-Day CI had reported 8 performance regressions and 6 improvements by doing shift-left testing on developer and maintainer repos.

 

Test Indicator

Mail

Test Scenario

Test Machine

Status

aim7.jobs-per-min

[btrfs] a9d3e24b6e: 19.9% improvement

disk: 4BRD_12G
md: RAID0
fs: btrfs
test: sync_disk_rw
load: 20
cpufreq_governor: performance

lkp-ivb-ep01

Currently not merged

netperf.Throughput_total_tps

[vfs] 8a2e54b8af: -26.1% regression

ip: ipv4

runtime: 300s

nr_threads: 25%

cluster: cs-localhost

test: TCP_CRR

cpufreq_governor: performance

lkp-bdw-ep2

Currently not merged, no response

netperf.Throughput_tps

[vfs] 4f3911e76e: -83.7% regression

ip: ipv4

runtime: 300s

nr_threads: 200%

cluster: cs-localhost

test: TCP_CRR

cpufreq_governor: performance

lkp-bdw-ex2

Currently not merged, no response

netperf.Throughput_tps

[vfs] b844d4878b: -72.8% regression

ip: ipv4

runtime: 300s

nr_threads: 25%

cluster: cs-localhost

test: TCP_CRR

cpufreq_governor: performance

lkp-bdw-ex2

Currently not merged, no response

perf-bench-numa-mem.GB_per_thread

[mm, numa] 755a36ce60: 28.9% improvement

nr_threads: 2t

mem_proc: 300M

cpufreq_governor: performance

lkp-bdw-ep6

Currently not merged

stress-ng.mq.ops_per_sec


[ipc] 5cd6a50ace: -100% regression

nr_threads: 100%
testtime: 1s

class: scheduler

cpufreq_governor: performance


lkp-bdw-ep6

Currently not merged, the author is fixing it.

stress-ng.pthread.ops_per_sec

[mm] f3c01d2f3a: -6.9% regression

nr_threads: 100%
testtime: 60s

class: scheduler

cpufreq_governor: performance


lkp-bdw-ep6

merged at v4.18-rc1, no response

unixbench.score

[fs] 8fbedc19c9: -10.2% regression

runtime: 300s
nr_task: 100%

test: syscall
cpufreq_governor: performance

lkp-bdw-de1

Currently not merged, no response

vm-scalability.throughput

[fs] 5c6de586e8: +12.4% improvement

runtime: 300s

test: small-allocs

cpufreq_governor: performance

lkp-hsw-ep5

Currently not merged, it might be a noise?

vm-scalability.throughput

[xfs] bef10baeec: 5.9% improvement

runtime: 300s
test: lru-file-readonce
cpufreq_governor: performance

lkp-ivb-d02

Currently not merged

vm-scalability.throughput

[page cache] 3e504b016b: +4.2% improvement

runtime: 300s
size: 1T
test: lru-shm
cpufreq_governor: performance

lkp-bdw-ep2

Currently not merged

vm-scalability.throughput

[xfs] 0610cbb3c6: +8.2% improvement

runtime: 300s
size: 1T
test: msync-umt
cpufreq_governor: performance

lkp-skl-2sp2

Currently not merged

vm-scalability.throughput

[shmem] 85766f621c: -15.5% regression

runtime: 300s
size: 16G

test: shm-xread-rand
cpufreq_governor: performance

lkp-ivb-d02

Fixed at f63145ae97

will-it-scale.per_process_ops

[mm] 230671533d: +135.1% improvement

nr_task: 100%

mode: process

test: page_fault3
cpufreq_governor: performance

lkp-skl-4sp1

Merged  at v4.18-rc1

will-it-scale.per_process_ops

[kprobes/x86] 80006dbee6: +2.6% improvement

test: sched_yield

cpufreq_governor: performance

lkp-bdw-ex2

Merged  at v4.19-rc1

will-it-scale.per_thread_ops

[kernel/kexec_file.c] b3ab44957a: -3.4% regression

nr_task: 50%

mode: thread

test: pread1

cpufreq_governor: performance

lkp-bdw-ep3d

Currently not merged , the author is fixing it.

 

  1. Test Machines

  1. IVB Desktop

model

Ivy Bridge

brand

Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz

cpu number

8

memory

16G

 

model

Ivy Bridge

brand

Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz

cpu number

4

memory

8G

 

  1. BDW EP

model

Broadwell-EP

brand

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

cpu number

88

memory

128G

 

  1. HSW EP

model

Haswell-EP

brand

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

cpu number

72

memory

128G

 

  1. IVB EP

model

Ivy Bridge-EP

brand

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

cpu number

40

memory

384G

 

model

Ivytown Ivy Bridge-EP

brand

Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

cpu number

48

memory

64G

 

  1. HSX EX

model

Brickland Haswell-EX

brand

Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz

cpu number

144

memory

512G