NFV-I Host Configuration For Low Latency
Additional contributions by: Yunhong Jiang, David Su, and Gary Wang
Network Function Virtualization (NFV) promises to liberate telecommunication providers from the constraints of legacy physical network functions and traditional business practices. However, handling network workloads in software, especially with small packet sizes, raises challenges of latency and jitter.
The Open Source Technology Center (OTC) in Intel’s Software and Services Group (SSG) has been studying and addressing NFV latency issues for the past few years from a compute host infrastructure perspective, including the interaction of key software components such as KVM, QEMU, and Open Virtual Switch (OVS). We present in this document the best-known methods (BKMs) and best-known configurations (BKCs) for configuring hosts and for measuring and tuning NFV latency in Linux*-based x86_64 systems.
Many studies and reports have addressed latency in NFV. For example, publications from OpenStack.org  and from Intel  address latency from a broader system perspective. We take a complementary approach, involving a deeper look at the compute host setup to minimize both maximum packet latency (which matters for hard real-time applications) and average latency (which is relevant to soft real-time applications).
The focus of this white paper is on network-intensive, latency-sensitive applications. They can be Virtual Network Functions (VNFs) or regular applications such as a web server, a browser, or a Content Delivery Network (CDN) server. They can be running on a physical host, in a virtual machine (VM), or in a container. VNFs are often built with DPDK, but most of the recommendations and findings in this white paper will apply even if that is not the case. However, the focus will be on VNFs running in VMs or containers.
This paper is aimed at architects, administrators, and operators who have basic networking configuration knowledge, and who wish to find recommendations for optimizing an IA server for packet latencies. Systems and configurations vary widely, and the recommendations here are for the common case, not every possible case. Further, optimizing latency is not the only goal; one may instead wish to improve throughput or minimize CPU usage for packet processing. Those are not addressed in this paper but, where applicable, we discuss some of the tradeoffs. As always, the system administrator must balance the competing goals and needs.
We first describe the motivation behind the need to optimize latency, in Section 2. In Section 3 we describe the hardware environment that is needed for to deterministically measure the latencies of processor features, BIOS settings, and NIC features. Section 4 describes the software configuration, including the real-time kernel patches that we have developed. Some tools that are useful for latency measurement are described in Section 5. Section 6 describes the methodology for measuring latency, while Section 7 presents the data that we collected in various environments, including the latency improvements to be realized by using our real-time kernel patches. Finally, Section 8 concludes the paper.
2: Need to optimize latency
There are many metrics of network performance, including latency, throughput, and resource utilization. Mechanisms that improve one will not necessarily improve others. For instance, one can improve throughput by processing packets in parallel, but that will not necessarily improve the latency of any one packet. However, improving latency can potentially increase throughput. Specifically, processing a packet requires a series of tasks, and one of those tasks is the rate-limiting step, which is the task that consumes the most time and thereby limits network throughput. Reducing the time taken by the rate-limiting step decreases packet latency while also enhancing throughput (although some other task may then become the rate-limiting step).
This is one important reason why it pays to focus on packet latency. Another important reason to optimize latency is that it is an important factor in the network response time faced by end users, which is a critical aspect in many networking applications, from multimedia to gaming. A third reason to lower latency is that some workloads are very sensitive to the maximum packet latency. For example, in wireless networks, the control signals from the wireless base station to the Central Office must be processed within a stringent time limit. If this fails, the synchronization with the base station is lost, causing the base station to reset. This can result in dropped calls.
The latency of packets in a network flow can be measured either in terms of the minimum, average, or maximum. For improving the response time of end users, average packet latency is the relevant metric, but for time-sensitive applications, maximum latency is the key. In this document, we focus on all three measures.
3: Hardware environment
In this section, we will discuss how the host hardware configuration affects the latency introduced by a Virtual Network Function (VNF) running on a host. We divide the hardware configuration into three aspects: processor features, BIOS settings, and NIC configuration. We look at each in this section.
The device features described in this section are not mandatory, yet they go a long way to make network latencies easier to measure and/or be more deterministic. This is especially the case for VMs, which matters because VNFs are often executed in or as VMs.
3.1: CPU and System Chipset
Newer Intel CPUs support several features that aid in precise timing or reducing packet processing overhead. Check ark.intel.com to see which CPU models support these features.
The following features help in precise timing or time measurement:
- The TSC (Time Stamp Counter) offers higher precision and requires lower overhead to access than other means of determining elapsed time. However, out-of-order instruction execution and other factors can affect the TSC’s precision. These issues are addressed in Intel’s white paper  on the usage of rdtsc instructions and associated issues.
- TSC-based Deadline Timer offers higher precision than a timer based on the external bus.
- Invariant TSC runs at a constant rate independent of ACPI (C- and P-) states.
The following features help in reducing packet latencies:
- Advanced Programmable Interrupt Controller (APIC) Virtualization and Posted Interrupts allow interrupts to Virtual Machines (VMs) to be handled in hardware rather than through software emulation. These require support from Intel® Xeon® processors as well as the Root Complex.
- Intel® Resource Director Technology includes Cache Allocation Technology (CAT), which allows controlled access to the cache shared by many cores, even when a neighboring core is running a noisy neighbor, like an application that uses the cache heavily.
- Code and Data Prioritization (CDP), an extension to CAT, allows control over placement of code and data in the shared cache.
BIOS settings can have a substantial impact on packet latencies. The following recommended settings will reduce packet latencies.
Many BIOS implementations exist, and some may or may not allow control or modification of these settings. Consult your BIOS documentation for information on your specific implementation.
- Non-Uniform Memory Access (NUMA) awareness: Enable this feature.
- Power Management: Disable this feature to keep packet latencies uniform, though it is at the expense of increasing power consumption. We present some data later to show how processor C-states can affect maximum packet latencies. Make an informed decision.
- Hyper-Threading: Consider disabling this feature unless other considerations are paramount. Enabling Hyper-Threading can increase packet latencies since hyper threads share some processor resources. That can be potentially mitigated by taking steps to not schedule any thread on the sibling logical core, but it can be easier and less error-prone to turn off Hyper-Threading.
- Legacy USB support and Port 60/64 Emulation: Disable this feature.
- Memory scrubbing: Consider disabling this ECC RAM feature unless the system is deployed in a very noisy environment or data integrity outweighs latency considerations. If enabled, this feature reads every word in RAM and corrects any single-bit errors found, which affects maximum latency. Packets are delayed if they arrive during a period when memory is being scrubbed.
- System Management Interrupts (SMIs): Although most BIOS do not allow you to disable this feature, you should disable this feature if possible. This feature can cause unpredictable, large delays (of the order of 100+ milliseconds), which can disrupt maximum latency, but since SMIs are rare, this feature is unlikely to affect average latency.
Numerous Network Interface Cards (NICs) are available from many vendors, including Intel. This white paper does not advocate any one brand or model, but points out certain generic features of modern NICs that help in lowering or measuring packet latency. NIC-specific features for lowering latency, such as registers to control DMA I/O latency, are outside the scope of this document.
Check your NIC’s data sheet to see if these generic features are present.
- Multiple queues with flexible packet classification: This feature enables the NIC to place incoming packets into different hardware queues based on criteria programmed by host software. Further, these queues are associated with different CPU cores such that packets arriving in a queue are processed on the corresponding core. This could be either because the queue is associated with an interrupt vector bound to that core, or because a driver running on that core is polling that queue.
- This feature can improve throughput by using several cores in parallel to process packets. It also improves latency in many ways:
- The average wait time for a packet in the queue is reduced compared to a single-queue implementation.
- Since related packets are placed together in a queue, memory (cache and page) locality is improved.
- SR-IOV with Pass-through: SR-IOV is part of the PCI standard and is not specific to networking. It enables a device to appear to the host as multiple virtual PCI functions. Further, one or more of those virtual PCI functions can be passed through to a user space application, giving it access to PCI constructs, such as configuration/memory spaces, DMA, and interrupts. This allows the user space application to bypass the host kernel, which avoids context switch overhead and thus results in better latency.
- Time stamping with IEEE 1588 (Precision Time Protocol): This feature enables precise measurements rather than lower latency. The NIC hardware can record the timestamps for sent and received Ethernet frames, potentially at the Ethernet transceiver level. More information can be found in .
4: Software environment
As you might expect, the software configuration of the host strongly influences its latency profile. Two of the main components of networking software in the host are the host kernel and the virtual switch. The third important component is DPDK, a software platform that processes packets in user space to help achieve optimal network performance. In this section, we discuss the best-known configuration of these three components, as well as the changes to the real-time kernel that we have implemented to reduce latency.
4.1: Kernel configuration
Mainline Linux distributions provide several features that can help with low latency requirements, but these features are not targeted specifically for low latency requirements. Low latency requirements are best met by the real-time Linux kernel provided by the Real-Time Linux Project, run by the Linux Foundation*. Clone the Real-Time Linux kernel code here.
For best results, configure the following features in the Real-Time Linux kernel. Remember that this configuration options optimize for low latency. Your application might have other needs that may be incompatible with some of these, or may need other kernel features. Use these features at your discretion.
- CPU/Core Isolation (
isolcpus): Threads of latency-sensitive applications are best run in isolated processor cores that are not scheduled by the kernel scheduler. This avoids interference from unrelated threads.
- Huge Pages and HugeTLBFS: The application must allocate memory from huge pages. This minimizes TLB usage and TLB misses. In addition, the allocated memory should ideally be from the same NUMA node where the application’s threads are running.
- IRQ Affinity: Bind device IRQs unrelated to the workload to a separate set of cores, which are dedicated to housekeeping duties, and away from isolated cores where the VNF, OVS-DPDK, or related performance-sensitive code is running. For example, interrupts from storage devices (SSD, hard disks), peripherals (keyboard, mouse, etc.), and other unrelated NICs must be bound to housekeeping cores. However, when using DPDK interrupt mode, the IRQs need to be bound to the isolated cores. Some OS distributions contain an irqbalance daemon that balances the IRQs among all the cores dynamically that you should disable as well.
- Tickless kernel: Frequent clock ticks cause latency. Configuring fewer clock interrupts on the CPU (currently, one tick per second) can reduce latency because each clock interrupt triggers a VM exit from guest to host, which affects performance and latency.
- Mark TSC as Reliable: A TSC clock source that seems to be unreliable causes the kernel to continuously enable the clock source watchdog to check if the TSC frequency is still correct. Marking TSC as reliable has been controversial in the past but, with recent advances such as Invariant TSC, this is best practice in our opinion.
- Idle: Configuring the CPU core to poll during idle can slightly improve the performance of waking up an idle CPU. To balance against the competing goal of saving power, the CPU core can be configured to enter one of several sleep states when idle. This results in increased latency for packets that arrive when the core is idle. We present data that show how deeper sleep states affect packet latency.
- RCU-NOCB: This option prevents RCU callbacks from running on the specified core, thus eliminating a source of packet latencies. The set of cores specified for this parameter is typically the same as the set of isolated CPUs.
- Disable Real Time (RT) Throttling: RT Throttling is a Linux kernel mechanism that occurs when a process or thread uses 100% of the core, leaving no resources for the Linux scheduler to execute the kernel/housekeeping tasks. RT Throttling increases latency and should be disabled.
- CONFIG_NUMA_BALANCING: This kernel configuration option adds support for automatic NUMA aware memory/task placement in the scheduler.
The set of all kernel configuration parameters used for an OPNFV project is listed in this page.
An example of the recommended arguments for the host kernel command line is listed below (the values for these arguments may vary in your environment):
isolcpus=11-15,31-35 nohz_full=11-15,31-35 rcu_nocbs=11-15,31-35 iommu=pt intel_iommu=on default_hugepagesz=1G hugepagesz=1G mce=off idle=poll intel_pstate=disable processor.max_cstate=1 pcie_asmp=off tsc=reliable
An example of the recommended arguments for the guest kernel command line of a VM is shown below (the values for these arguments may vary in your environment):
isolcpus=1 nohz_full=1 rcu_nocbs=1 mce=off idle=poll default_hugepagesz=1G hugepagesz=1G tsc=reliable
4.2: Packet switching: SR-IOV vs. OVS-DPDK
The host infrastructure for packet switching among VMs or containers generally consists of a virtual switch, which is either implemented in software or is part of the hardware of a NIC with SR-IOV features. Traditionally, Linux systems offer a software bridge that provides basic functionality but is not optimized for performance. The Open Virtual Switch (OVS) offers more features, such as VLAN support, and is designed to be programmable. However, when implemented in the kernel, it can handle only a limited number of flows and is not optimized for packet latencies. When OVS runs in user space with DPDK, it offers better performance. You can find more information in the Intel white paper on SR-IOV and OVS .
The main trade-off to understand is that, when SR-IOV is used to expose a PCI function to a VM, the VM must run the NIC driver to operate that specific PCI function. This introduces a host hardware dependency in the VM. With OVS DPDK, the latencies are not as low as with SR-IOV pass-through, but there is no need for a NIC driver in the VM. So, if latency is the paramount consideration, SR-IOV is preferable.
4.3: DPDK configuration and interrupt mode
The basics of installing and configuring DPDK are addressed in .
DPDK drivers are traditionally deployed in poll mode. In poll mode, the driver takes over the core where it runs, taking 100% of the core’s time and power budget to continually poll the NIC interface. No other computation can happen on that core when it is in poll mode. The advantage of this mode is that when a packet arrives it is processed immediately, minimizing latency.
We have enabled an Interrupt Mode in DPDK, wherein a driver enables NIC interrupts and processes packets in response to interrupts. This opens the possibility of either saving power by putting the processor to sleep when there is no traffic, or scheduling other tasks on the same core to run during idle periods. As expected, this increases latency for two reasons: the overhead of delivering interrupts to user space, and the cost of task switching or waking up the processor before processing the first packet that comes after an idle period. However, as shown in Section 7.3, this does not increase latency by very much. This enables a discerning administrator to decide whether to use the Interrupt Mode.
4.4: Real-Time kernel patches
After measuring the latencies using the default kernel that comes with CentOS*, we applied some changes to the Real-Time kernel to improve the latencies and to enable DPDK interrupt mode. These are available from the KVM4NFV Git repository . In addition, we also backported some patches from the real-time kernel into this repository. The latency improvements that we present in Section 7 are based on this kernel and the configurations described in Section 4.1. We recommend using the kernel from this repository to optimize latency.
Intel contributed to following changes to the kernel in the KVM4NFV Git repository:
- Change the VFIO interrupt handling to be nonthreaded. A threaded ISR requires a context switch that adds to the latency.
- For the same reason, avoid soft IRQ for the timer interrupt.
- DPDK interrupt mode patch #1: Switch to nonthreaded interrupt handler for the igb_uio driver to avoid the cost of a potential context switch when an interrupt arrives.
- DPDK interrupt mode patch #2: Disable time-based interrupt throttling for the ixgbe driver.
The KVM4NFV Git repository also includes the following patches that were backported from the real-time kernel:
- APIC Virtualization improvements to reduce interrupt delivery time.
- Utilize the VMX preemption timer to reduce the VM exit times.
Many tools are available to assist in measuring latencies accurately over millions of packets. In this paper, we recommend two:
- Cyclictest: Cyclictest results are the most frequently cited real-time Linux metric. We normally run Cyclictest in a VM and check the maximum, minimum, and average values to evaluate the system. Detailed information can be found in .
- Ftrace: This enables function tracing in the kernel. This is useful in observing VM exit handlers and non-DPDK code when it is sharing the core with the workload of interest.
6: Latency measurement methodology and best practices
The test setup to measure latencies requires three components:
- The System Under Test (SUT), which runs the workload, as a process, VM, or container.
- A packet generator.
- Network connectivity between the SUT and the packet generator.
The SUT is typically a Linux server-class computer with at least two NICs, one for management connectivity and one connected to the packet generator. The workload is generally an application built on DPDK.
The packet generator may be a hardware appliance, such as those from Ixia or Spirent, or a software application, such as the popular open source software-based traffic generators pkt-gen, moongen, or trex. For many purposes, the open source traffic generators can generate adequate load levels. We recommend using packet generators that can use IEEE 1588 (Precision Time Protocol, or PTP)—refer to Section 3.3. For example, moongen can use hardware timestamping in the NIC.
Ideally, the packet generator and the SUT should be connected back to back but not through a switch, since switches can introduce large and variable latencies. See Figure 1 for the recommended testbed setup. This is essentially a round-trip measurement that includes wire latency and the SUT’s stack latency, apart from the application’s latency.
Figure 1: Latency Measurement Setup
Each test run should be at least two hours long, and ideally each test run should last for 48 hours. The longer duration ensures that occasional events that may increase the maximum observed packet latency are accounted for. Each run should be repeated at least three times to ensure that the results are reproducible and do not have abnormally high variation.
Some factors that affect latencies, such as System Management Interrupts (SMIs), cannot be mitigated but can be tracked. The count of SMIs on a specific core can be read with the command:
$ rdmsr -p core_id 0x34 –d
This count must read for each core where the workload is run, before and after the test run to check if SMIs happened during the run.
7.1: Goals and test configuration
The maximum latency of packet processing is a critical parameter in NFV, especially in wireless applications, as explained in Section 1. Our goal was therefore to limit the maximum latency to 100 microseconds. As the data below indicate, we have met this goal, while also making substantial improvements in average packet latency in a variety of deployment configurations.
We measured packet latencies on both the stock 3.10 kernel that comes with CentOS 7 and with our optimized KVM4NFV kernel. We ran the stock kernel with the default configuration, without isolated CPUs or best known configurations, and we ran the optimized KVM4NFV Real-Time kernel with all best-known configurations, including isolated CPUs.
We used two workloads to compare these kernels. The first workload was the L2Forward application based on DPDK, which passes every packet from the ingress port to egress port, changing only the source/destination MAC addresses. The second was a representative VNF, specifically a virtual firewall based on DPDK, which is available from the SampleVNF project in OPNFV. Our focus was not on firewall functionality per se, but rather on the observed latencies. Since our focus was comparing the unoptimized and optimized kernels, we did not try to optimize the virtual firewall code.
The test configuration and method that we used are conformant to the practices mentioned in Section 4. Here are some salient features of the SUT:
- CPU: Intel® Xeon® E5-2699 v3 (HSW) @ 2.30 GHz
- OS: CentOS 7
- Kernel: Either the standard 3.10 kernel or the KVM4NFV real-time kernel with the patches mentioned in Section 4.4. The configuration options are those mentioned in Section 5.1.
- NIC: Intel® Ethernet Controller X540-AT2, 10 Gbps.
- vCPUs: Bound to isolated host CPU cores.
- OS: CentOS 7 (same as host).
- Kernel: Same as host, with the guest kernel configuration options mentioned in Section 5.1.
- NIC: Passed through to guest with vFIO-PCI driver.
We used the moongen packet generator with the PTP timestamping feature. The packet rate is limited to avoid congestion and the resulting latency variations; therefore, the packet rate is much lower than line rate.
We present the data from the L2 Forward application and from the virtual firewall in the next two sections.
7.2: Latency improvements with Real-Time kernel
The following graphs depict the latencies measured with L2 Forward application, with the stock kernel from CentOS 7 and with the optimized KVM4NFV kernel.
Note: When comparing latencies in these charts, note the disparity in the vertical axes. Minimum and average latencies are identical or nearly identical in bare metal, VM, and container environments. Maximum latency for the optimized kernel in the VM is nearly twice as large as in the bare metal and container environments. Maximum latency for the stock kernel is 57 percent higher for the container than bare metal, and it is 1,825 percent higher for the VM than for bare metal.
Figure 2: Latencies with DPDK L2 Forward in a bare metal deployment
Figure 3: Latencies with DPDK L2 Forward in a VM-based deployment
Figure 4: Latencies with DPDK L2 Forward in a Container-based Deployment
As the data indicates, the maximum latency is improved substantially in the optimized RT kernel (KVM4NFV kernel).Latency improvements with real world VNFs.
7.3: Latency improvements with real world VNFs
In this section, we present data on the latency improvements for a virtual firewall when using our optimized kernel.
Figure 5: Latencies with a virtual firewall in a bare metal deployment
Figure 6: Latencies with a virtual firewall in a VM-based deployment
Figure 7: Latencies with a Virtual Firewall in a container-based deployment
The data shows that maximum latencies are considerably improved when using the KVM4NFV kernel.
7.4: DPDK Interrupt Mode
As explained in the DPDK configuration and interrupt mode section, when a DPDK driver operates in Interrupt Mode, packet latency increases, though there are other benefits. In this section, we present data that quantifies the increase in different situations, based on the KVM4NFV kernel.
Figure 8: Latencies in Poll and Interrupt Modes, in Bare Metal
Figure 9: Latencies in Poll and Interrupt Mode, in VM
7.5: DPDK Interrupt Mode and processor sleep states
When DPDK is used in Interrupt Mode instead of Poll Mode, there is a possibility of saving electrical power by letting the processor sleep during periods when there is no traffic. However, when a packet arrives after an idle period, the processor needs to transition to the active state before processing the packet, which increases latency. A processor can enter a number of sleep states; deeper sleep states save more power and require more time to re-enter the active state.
To make an informed choice on whether to enable sleep and which sleep states to enable, we studied how much the maximum latency increases with different sleep states. Since packets arrive in bursts, only the first packet in a burst will face the increased latency. When there are a large number of packets, the average and minimum latencies are not affected much. So, we present the maximum latency alone.
The test configuration and methodology for this study were the same as in previous sections, except that the
idle=poll kernel parameter is replaced with the
Figure 10: Increase of Maximum Latency with Sleep States
As can be seen from Figure 4, the maximum latency increases in a predictable way with sleep state.
Latency measurement and tuning can be complicated, yet is essential for NFV deployments. While there are many reports from different organizations about optimizing for latency, Intel is in a unique position to contribute as it is both a hardware vendor (offering Intel Xeon chips, system chipsets, NICs and accelerators like FPGAs) as well as a software developer (DPDK, KVM enhancements, etc.). This white paper has presented an in-depth characterization of the preparation and configuration of NFV-I hosts for low latency, from hardware and software perspectives. It has described an interrupt mode for DPDK, which can potentially save CPU power or increase CPU utilization at the expense of slightly higher maximum packet latency. We systematically compared the maximum/average/minimum latencies between the stock kernel in CentOS and the optimized KVM4NFV real-time kernel, with two workloads—DPDK L2 Forward (a microbenchmark) and a Virtual Firewall (a representative VNF application)—and in three configurations—bare metal, VM, and Docker container. We presented detailed data showing that the best practices described in this paper can reduce the maximum packet latency by up to 98% (L2 Forward in VM). By following the recommendations of this white paper, telecommunication carriers can realize the benefits of NFV while maximizing performance, as measured by latency.
- Achieve Low Latency NFV with OpenStack: A publication from OpenStack.org
- A Low-Latency NFV Infrastructure for Performance-Critical Applications: An Intel white paper
- How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures: An Intel white paper.
- Timestamping in Linux.
- SR-IOV for NFV Solutions: An Intel white paper comparing performance with SR-IOV and OVS.
- DPDK Getting Started Guide for Linux.
- KVM4NFV Git Repository. This link can be used with the git clone command but cannot be accessed with a browser.
- Cyclic Test