Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Improve network performance by setting per-queue interrupt moderation in Linux*

BY Kan Liang ON May 05, 2017

By: Kan Liang, Andi Kleen, and Jesse Brandenburg

Introduction

This article describes a new per-queue interrupt moderation solution to improve network performance with XL710 and X710-based 40G Intel® Ethernet network connection.

Background

High network throughput and low latency are key goals that many enterprises pursue. However, there is a trade-off between latency and throughput from the perspective of an Ethernet controller. To achieve lower latency, the controller usually minimizes the interval between interrupts to speed up small packet processing at the price of causing higher CPU usage and lower throughput. On the other hand, to improve throughput and minimize the overhead caused by frequent interrupts, larger interrupt intervals are desirable [1]. So an appropriate interrupt interval is critical to getting the best performance balance. The optimal interrupt interval is supposed to reduce CPU overhead when handling interrupts without adding significant packet latency or causing undue packet loss.

To balance CPU efficiency for bulk traffic with minimal latency, the Ethernet driver supports adaptive interrupt moderation, which dynamically changes the interval according to packet size and throughput. However, there are still big performance gaps compared to hand-tuned results. This article describes a new solution that sets interrupt moderation per-queue to get better network performance.

Interrupt moderation

Interrupt moderation is a driver feature that allows the user to manage the rate of interrupts to the CPU during packet transmission and reception. Without interrupt moderation, the system triggers an interrupt for every transmitted and received packet. Although this can minimize the latency on each packet, extra CPU resources are spent on interrupt-processing overhead, which can be significantly reduce throughput. When interrupt moderation is enabled, multiple packets are handled for each interrupt so that the overall interrupt-processing efficiency is improved and CPU utilization is decreased. However, in this case, latency is also increased. As a result, determining the optimal interrupt moderation settings usually involves a trade-off between latency and throughput performance.

Interrupt moderation can be set by ethtool coalesce settings. The parameter rx-usecstx-usecs sets the minimum interval for transmitted packets.

System setup

The following hardware and software are used in this document’s examples:

  System 1 (Code name Ivytown) System 2 (Code name Ivytown)
Processor Intel® Xeon® Processor E5-2697, 2.70 GHz, 12 cores per processor Intel® Xeon® Processor E5-2695, 2.40 GHz, 12 cores per processor
Platform Total 48 logical CPUs running with Intel® Hyper-Threading Technology enabled, Intel® Turbo Boost Technology disabled, Intel® SpeedStep disabled, CPU C-states disabled, PCI Express Active-State Power Management disabled, 64 GB of DDR3 Total 48 logical CPUs running with Intel® Hyper-Threading Technology enabled, Intel® Turbo Boost Technology disabled, Intel® SpeedStep disabled, CPU C-states disabled, PCI Express Active-State Power Management disabled, 32 GB of DDR3
Ethernet controller XL710 for 40 GbE QSFP+ with i40e driver 1.4.25-k XL710 for 40 GbE QSFP+ with i40e driver 1.4.25-k
Software Fedora FC23 updated with Linux 4.6.0 kernel[1] and ethtool[2] Fedora FC23 updated with Linux 4.6.0 kernel1 and modified ethtool2
Benchmark netserver of Netperf version 2.7.0 netserver of Netperf version 2.7.0

Adaptive interrupt moderation solution

To balance between latency and throughput, we can introduce an adaptive interrupt moderation solution to the current system. For general cases, it achieves good network performance through the following simple procedures.

Adaptive interrupt moderation

Adaptive interrupt moderation [2] is a driver feature that is enabled by default. It dynamically adjusts the interrupt rate based on the packet size and the average throughput.

The adaptive interrupt moderation for the transmitter and the receiver side can be turned on/off by setting the adaptive-tx and adaptive-rx parameters using ethtool, as shown.

ethtool -C $ETH_NAME adaptive-tx on
ethtool -C $ETH_NAME adaptive-rx on

Assigning applications and interrupts to cores

Network performance can also be improved by assigning interrupts and applications to the same specific core. This method reduces the overhead spent on cache synchronization between different cores [3].

The Intel Ethernet driver source code usually contains a script (set_irq_affinity.sh) that can set interrupt CPU affinities.

service irqbalance stop
set_irq_affinity.sh all DEVNAME

Per-queue interrupt moderation solution

While adaptive interrupt moderation might be good enough for general cases, it does not provide as good performance as you can achieve by tuning each case individually. A new per-queue interrupt moderation solution resolves this issue and can achieve optimal performance.

There are multiple queues in a device. Each queue has its own interrupt moderation. The idea is to divide the queues into different groups and use a different dedicated interrupt moderation policy for each group. Each application is assigned to a queue with a certain policy depending on the nature the application. The details of the policy assignment are described in the remainder of this section. There are also some hardware and platform BIOS configurations to consider when grouping the queues, which might also impact performance.

You can set the per-queue interrupt moderation for the transmitter and the receiver side by setting the per-queue coalesce parameters using ethtool:

ethtool --set-perqueue-command $ETH_NAME queue_mask $MASK --coalesce $OPTIONS

Policies

For different application requirements and characteristics, we designed three different policies for queues:

  • LATENCY policy
  • BULK policy
  • CPU policy

LATENCY policy

The LATENCY policy is designed to achieve the lowest latency. Under this policy, the queues are assigned the highest interrupt rate so that the packets are processed immediately. Usually this policy applies to small packets. Small packets can be processed quickly without introducing a lot of CPU overhead.

According to the test results, the optimal interrupt moderation for the LATENCY policy is rx-usecs 5 and tx-usecs 10.

BULK policy

The BULK policy is designed to achieve the highest throughput. Under this policy, the queues are assigned an intermediate interrupt rate so that the packets can be expediently processed in groups. The drawback is that CPU utilization can be higher due to the larger number of interrupts, which leaves fewer CPU resources for applications. Usually this policy applies to arbitrary packet sizes on a lightly loaded system.

According to the test results, the optimal interrupt moderation for BULK policy is rx-usecs 62 and tx-usecs 122.

CPU policy

The CPU policy is designed to achieve a high throughput with reasonable CPU utilization. It tries to decrease CPU utilization on the network and leaves more resources for applications. Under this policy, the queues are assigned a low interrupt rate to reduce the interrupt overhead while still maintaining the highest available throughput. Usually this policy applies to arbitrary packets sizes on a heavily loaded system.

According to the test results, the optimal interrupt moderation for CPU policy is rx-usecs 125 and tx-usecs 250.

Queue assignment considerations

It’s also essential to pin the queues and the interrupts to a specific core by setting their CPU affinities.

If the applications running on the system all require low latencies or high throughputs, one policy should be applied to all queues.

If there are mixed workloads (a mixture of applications that require high throughput and applications that require low latency) running on the system, we need to divide the queues into different policy groups. For example, we can divide the queues into latency queues (for applications requiring low latency) and bulk queues (for applications requiring high throughput). For latency queues, the LATENCY policy should be applied. For bulk queues, either the BULK policy or the CPU policy should be applied, depending on the load of the system.

The number of the latency queues and the number of bulk queues can vary in different situations. Usually, the applications that require high throughput utilize more CPU resources compared to applications that require low latency. This means that if there is an equal interest in latency and throughput performance, it is better to reserve more bulk queues than latency queues.

HyperThreading

If HyperThreading is on, there are logical cores sharing the same physical cores. Since the applications that require high throughput usually utilize more CPU resources, try to avoid letting two applications that require high throughput occupy the same physical core.

Nonuniform memory access

If the system has a NUMA architecture, pinning the application and the queue to a specific CPU in the same NUMA node as the network device will always achieve better performance.

Test results

Single type workload

In this section, we design tests to verify the network performance for a single type of workload. The single-type workload means that the applications running on the system all require either low latency only or high throughput only. We compare the performance results of the adaptive interrupt moderation solution to the new per-queue interrupt moderation solution.

Benchmark

The test separately evaluates the latency performance and the throughput performance with the netperf benchmark. It was conducted on the Linux operating system using two Intel® E5-2695-based platforms.

Latency performance test

A netperf TCP_RR test is conducted for the latency performance test. The netperf command for the test is:

netperf -t TCP_RR -H server_IP -T cpu,cpu -c -C -l 60 -- -r message_size

Since only latency-first threads are running on the system, the LATENCY policy is applied to all queues.

An application that requires low latency usually uses small message sizes. So only small message sizes (from 64 bytes to 1 KB) are verified in the latency performance test.

Throughput performance test

A netperf TCP_RR test with burst-mode enabled is conducted for the bidirectional throughput performance test. The netperf command for the test is:

netperf -f m -t TCP_RR -H server_IP -T cpu,cpu -c -C -l 60 -- -r message_size -b burst_count -D

The CPU policy and the BULK policy are both designed for high throughput applications. So both policies are verified.

Message sizes from 64 bytes to 64 KB are verified in this test.

System configuration for the adaptive interrupt moderation solution

For the adaptive interrupt moderation solution, adaptive-rx and adaptive-tx are set to ON. There are 48 queues and 48 cores in total. A one-to-one mapping pins each queue to a specific core.

System configuration for the per-queue interrupt moderation solution

For the per-queue interrupt moderation solution, adaptive-rx and adaptive-tx are set to OFF. There are 48 queues and 48 cores in total. A one-to-one mapping pins each queue to a specific core. Also, the relevant policy is applied to each queue by ethtool.

Performance

Latency first workload

For the latency-first workload, we compare the performance results from the baseline, the current adaptive interrupt moderation solution, and the per-queue interrupt moderation solution for messages of sizes between 64 bytes and 1 KB (shown on the X-axis of Figures 1 through 3). Baseline means that there are not any optimization options applied on the system. The performance under different system loads is tested by increasing the total number of threads. During the test, test threads of the same message size are launched at the same time. The normalized latency performance results are shown on the Y-axis, where a higher number is more desirable. The solid lines in the figure represent the CPU utilization.

The best latency performance can be observed with the LATENCY policy for any system load.

Figure 1. Light load system latency performance comparison between the baseline, the current adaptive interrupt moderation solution, and the per-queue interrupt moderation solution.

Figure 2. Medium load system latency performance comparison between the baseline, the current adaptive interrupt moderation solution, and the per-queue interrupt moderation solution.

Figure 3. Heavy load system latency performance comparison between the baseline, the current adaptive interrupt moderation solution, and the per-queue interrupt moderation solution.

Throughput first workload

For a bidirectional, throughput-first workload, we compare the performance results from the baseline, the current adaptive interrupt moderation solution, and the per-queue interrupt moderation solution with packet sizes between 64 bytes and 64 KB (shown on the X-axis of Figures 4 through 6). Baseline means that there are not any optimization options applied to the system. The performance under different system loads is measured by increasing the total number of threads. During the test, the test threads of the same packet size are launched at the same time. The normalized throughput performance results are shown on the Y-axis, where a higher number is more desirable. The solid lines in Figures 4 through 6 below represent the CPU utilization.

On the lightly loaded system, the best throughput performance can be observed with the BULK policy (Figure 4). On the medium-loaded system, the best throughput performance can be observed with the CPU policy (Figure 5). Furthermore, the CPU policy consumes fewer CPU resources, which leaves other applications with more resources (Figure 5). On the heavily loaded system, the best throughput performance can be observed with the CPU policy as well (Figure 6).

Figure 4. Light load: throughput of adaptive moderation vs. per-queue moderation.

Figure 5. Medium load: throughput of adaptive moderation vs. per-queue moderation.

Figure 6. Heavy load: throughput of adaptive moderation vs. per-queue moderation.

Mixed workload

In this section, we design some tests to verify the network performance for mixed workloads (a combination of applications requiring high throughput and applications requiring low latency). We compare the performance of the adaptive interrupt moderation solution with the new per-queue interrupt moderation solution.

Benchmark

The mixed workload performance tests are conducted for Linux operating system using two Intel® E5-2695-based platforms as well. The mixed workload consists of several latency-first threads and several throughput-first threads. For details of latency-first threads and throughput-first threads, refer to the Latency performance test and Throughput performance test sections.

System configuration for the adaptive interrupt moderation solution

For the adaptive interrupt moderation solution, adaptive-rx and adaptive-tx are set to ON. There are 48 queues and 48 cores in total. A one-to-one mapping pins each queue to a specific core.

System configuration for the per-queue interrupt moderation solution

The test platform is an x86_64 SMP system with 48 logical CPUs, 2 NUMA nodes, and Hyper Threading. The CPU topology is summarized as follows:

 

The number of queues equals the number of logical CPUs. A one-to-one mapping pins each queue to a specific CPU.

It is also assumed that the user has equal interest in latency and throughput performance.

Use the steps below to configure the system.

  • Node 0: 0-11 (physical core 0-11) 24-35 (physical core 0-11)
  • Node 1: 12-23 (physical core 12-23) 26-47 (physical core 12-23)
  1. Assign each queue to a specific CPU using the following commands:
    service irqbalance stop
    set_irq_affinity.sh DEVNAME
  2. Group queues for various policies.

    In accordance with the Queues assignment consideration section, 32 queues are reserved as throughput queues. The remaining 16 queues are latency queues. Half of the queues are pinned to the local node (where the NIC is connected), while the other half are pinned to the remote node.

    When HyperThreading is on, all the latency queues share the same physical core as the throughput queues.

    As a result, the CPU topology for dedicated policy groups becomes as follows:

    • Throughput queues: 8-11, 20-47
    • Latency queues: 0-7, 12-19
  3. Set policies for queues in each group.
    • For latency queues, the LATENCY policy is applied.
    • For throughput queues, the BULK policy is applied.

Scoring system

This section provides a scoring system to evaluate the mixed workload performance. The score is a single aggregated number and is calculated through a weighted sum system. Here it is described as the following:

Score = normalized_latency_score * Weight + normalized_throughput_score * (1 - Weight).

The weight is a number between 0 and 1 that reflects the user's relative interest in throughput performance and latency, with 0 meaning the user is only interested in throughput performance, and 1 meaning the user is only interested in the latency performance. If the user has an equal interest in latency and throughput performance, the weight is set to 0.5.

The latency score is calculated as the following:

normalized_latency_score = 1/ (latency in tuned system / latency in baseline system).

The throughput score is calculated as the following:

normalized_throughput_score = throughput in tuned system / throughput in baseline system.

The higher the score, the better the performance. Any score greater than 1 favors the tuned system.

Performance

The mixed workload is a combination of bidirectional throughput-first and latency-first threads. Both threads are simulated by netperf. For bidirectional, throughput-first workloads, packets of sizes between 64 bytes and 64 KB are tested (shown on the X-axis of Figures 7 through 11). For latency-first workloads, packets of sizes between 64 bytes and 1 KB are tested. During the test, the bidirectional throughput and latency-first threads are launched at the same time with same test packet size. When the packet size is greater than 1 KB, the packet size for latency-first threads is always kept to 1 KB. It also tests the performance under different system loads by increasing the total number of threads. Refer to the legend for details. The score is shown in the Y-axis of Figures 7 through 11, where a higher score is more desirable. Any score greater than 1 favors the per-queue interrupt moderation approach.

Finally, three different types of combinations are tested (Figure 7 shows half latency-first threads and half throughput-first threads, Figure 8 shows two thirds latency-first threads and one third throughput-first threads, and Figure 9 shows one third latency-first threads and two thirds throughput-first threads). Two corner cases are also tested (Figure 10 shows all throughput-first threads, and Figure 11 shows all latency-first threads).

Figure 7. Mixed workload (half latency threads and half throughput-first threads) network performance comparison between the current adaptive interrupt moderation solution and the per-queue interrupt moderation solution. The new per-queue interrupt moderation solution provides a superior performance score.

Figure 8. Mixed workload (two thirds latency threads and one third throughput-first threads) network performance comparison between the current adaptive interrupt moderation solution and the per-queue interrupt moderation solution. The new per-queue interrupt moderation solution provides superior performance, as shown by the score.

Figure 9. Mixed workload (one third latency threads and two thirds throughput-first threads) network performance comparison between the current adaptive interrupt moderation solution and the per-queue interrupt moderation solution. The new per-queue interrupt moderation solution provides superior performance as shown by the score.

Figure 10. All throughput-first threads network performance comparison between the current adaptive interrupt moderation solution and the new per-queue interrupt moderation solution. The new per-queue interrupt moderation solution provides a superior performance score when the system loads are low. A few cases under heavy loads (96 threads) have some performance drop because some queues are reserved for latency queues. The user can get the performance back by reducing the number of latency queues.

Figure 11. All latency-first threads network performance comparison between the current adaptive interrupt moderation solution and the per-queue interrupt moderation solution. The new per-queue interrupt moderation solution provides a superior performance score. Only a few cases under heavy loads (96 threads) have a very limited performance drop because most queues are reserved for throughput queues. The user can get the performance back by reducing the number of throughput queues.

Conclusion

In summary, the per-queue interrupt moderation solution customizes the queues for specific threads. This solution improves the network performance for both mixed workloads and single type workloads. However, the new solution depends on the user understanding the workload type. If the user knows nothing about the workload, the old adaptive solution is still a good choice.

In this article, we assume that there is an equal interest in latency and throughput performance. So we set the latency and throughput queues to a fixed number. In practice, the requirements and environment could vary. It is important to change the number of dedicated queues accordingly. This adaptive assignment of queues is the next topic we would like to investigate.

References and footnotes

References

Footnotes