Sorry, you need to enable JavaScript to visit this website.

OpenStack OVS-DPDK Tuning

BY Igor DC, Matt Welch ON Jun 23, 2020

Introduction

Data Plane Development Kit (DPDK) is a set of libraries and drivers to accelerate packet processing without the need for custom hardware/ASICs, such as routers and switches. Open vSwitch (OVS) is a virtual switch supporting standard management interfaces and a multitude of protocols. By default, it's used in OpenStack* to materialize virtual networks. Open vSwitch together can be paired with DPDK for enhanced dataplane performance.

This document discusses some system features and configurations that should be used to enable Open vSwitch to use DPDK and to enhance network performance. The versions assumed below are Open vSwitch 2.11 and DPDK 18.11.

DPDK enabling

Three steps are required to enable DPDK:

  1. Enable hugepages on compute nodes.
  2. Enable the networking-ovs-dpdk devstack plugin.
  3. Use nova flavor configuration to utilize hugepages.

Optionally, you can perform the following steps to enhance performance:

  1. Setup IOMMU for interface passthrough.
  2. Use 1GB hugepages for DPDK, 2MB hugepages for VMs.
  3. Choose the optimal driver for your environment.
  4. Enable CPU isolation to decrease resource contention.
  5. Set interface MTU.

Enable hugepages

To support the poll-mode driver (PMD) that DPDK uses, OpenStack compute nodes must enable larger memory pages, known as hugepages. Typical memory pages are 4096 bytes, whereas hugepages can be 2MB or 1GB, depending on the memory subsystem. DPDK can use either 2MB or 1GB hugepages, with 1GB recommended for higher performance in some scenarios. The following examples all set the system to have 8GB of available hugepages, but these quantities should be tuned to the amount of available memory on the host system and the memory requirements of DPDK-enabled virtual machines.

To temporarily enable huge pages, you can set the number of enabled hugepages by writing into the sysfs filesystem. Both examples below reserve and mount 8GB of memory to hugepages (HugeTLBFS), but with different sizes in each.

Non-NUMA

4000 x 2MB hugepages:

echo 4000 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir /dev/hugepages2M
mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M

 

8 x 1GB hugepages:

echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
mkdir /dev/hugepages1G
mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G

 

To persistently enable hugepages through system reboots, add the following to the kernel command line:

hugepagesz=1G hugepages=8

NUMA

Example for first NUMA node (node0):

4000 x 2MB hugepages:

echo 4000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir /dev/hugepages2M
mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M

 

8 x 1GB hugepages:

echo 8 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
mkdir /dev/hugepages1G
mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G

 

Important: To allow 1G pages to be used in runtime, the kernel command line options also need to be added.

Also, hugepages should be set for each NUMA node where virtual or physical interfaces are going to be used with OVS-DPDK.

For more information on Hugepages in DPDK and in OpenStack please go to:

http://doc.dpdk.org/spp/setup/getting_started.html https://docs.openstack.org/nova/pike/admin/huge-pages.html

Enable devstack plugin

To use OVS-DPDK with OpenStack, we must enable the networking-ovs-dpdk devstack plugin. This is accomplished by adding the following line to devstack/local.conf (after git-cloning and checking out devstack):

enable_plugin networking-ovs-dpdk https://opendev.org/x/networking-ovs-dpdk master

Use nova flavors with large memory pages

Flavors are used by OpenStack nova to describe resource requirements that must be available to provision a given VM. DPDK requires that OpenStack VMs use hugepages. In OpenStack terms, we must set their flavors to use "large" memory pages with the following command:

openstack flavor set <flavor_name> --property hw:mem_page_size=large

The flavor being configured must have a memory requirement that is an integer multiple of the memory page size. For example, the flavor 'm1.medium', usually set to use 4GB of memory can use either 2MB or 1GB hugepages since its memory requirement is evenly divisible by their respective sizes. Small memory flavors like m1.nano, setup by devstack, uses only 64MB of memory so it can only use 2MB hugepages. The memory requirements for flavors may also be set with the following in place of the 'large' keyword: small, 4KB, 2MB, 2048, 1GB. For small and 4KB, the VM will not request huge pages at all.

Failure to properly set the memory requirement leads to failures scheduling VMs onto compute hosts and results in errors such as: "No valid host was found"

For more info on nova flavors, go to https://docs.openstack.org/nova/latest/user/flavors.html

DPDK performance optimizations

Here are some recommendations to enable high-performance network connections with networking-ovs-dpdk:

  1. Enable IOMMU to enable interface passthrough.
     
  2. DPDK requires hugepages for operation and can be configured with either 2MB or 1GB huge pages. For more information, refer to https://docs.openstack.org/openstack-ansible-os_neutron/latest/app-openvswitch-dpdk.html
  3. Use the recommended driver for DPDK, vfio-pci, when running on a bare metal host.
     
  4. Use CPU Isolation to decrease resource contention.
     
  5. Increase MTU for improved bandwidth (configured via OVS).
     

Setup IOMMU

To persistently enable the IOMMU in the Linux* kernel, add the following to the kernel command line for Intel CPUs: intel_iommu=on iommu=pt

Setup hugepages

Although networking-ovs-dpdk will operate correctly and perform similarly when configured with 2MB versus 1GB hugepages on the host, it is recommended to go with the latter if allocating a total of more than 64 GB (rough estimate) of hugepages, based on how long the initial allocation of these pages will take. Additionally, in some scenarios, performance indeed improves if the 1GB hugepage size is chosen.

For more information, go to  https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html#use-of-hugepages-in-the-linux-environment

It's also important to use a second pool of hugepages for the VMs, leaving the main pool dedicated to Open vSwitch. The recommendation stems from an issue with how DPDK restarts, where it allocates all hugepages first and then selects a contiguous block and frees the unneeded pages. A VM may fail to boot while the vswitch is restarting due to sharing the hugepages pool, so it's best to use separate pools.

The recommendation to specify a second pool of hugepages for the VMs applies to libvirt (the default virtualization API used by OpenStack/Nova). By default, libvirt will look at all available HugeTLBFS mount points and use any of them as the huge page allocation for VMs. After following the commands above on how to enable hugepages, the system should have at least one HugeTLBFS mount point, for instance:

$ mount | grep huge
hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,seclabel)

 

At this point, a second mountpoint should be created manually, for example one using the 2M hugepage size:

mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M

 

Finally, the most important part in setting this separate pool for VMs, is to edit the file /etc/libvirt/qemu.conf and set the following to the new mount point created:

hugetlbfs_mount = "/dev/hugepages2M"

 

This will tell libvirt to use the new mount point instead of the one originally created (1G).

 

Make sure to reboot libvirt after the configuration above has been made:

systemctl restart libvirtd

 

qemu.conf documentation can be seen in: https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu.conf

Additional information can also be found in LWN: https://lwn.net/Articles/376606/, including how to use the hugeadm tool to manage multiple HugeTLBFS mount points.

Driver choice

Until recently, the default driver for DPDK has been the igb_uio driver, but this driver should be deprecated in favor of the vfio-pci driver for better performance and security.

Here is a list of the available drivers that support DPDK in OpenStack. For bare metal hosts that have an IOMMU and are capable of UEFI boot, the vfio-pci driver is recommended for highest performance and security. The default option is the uio_pci_generic driver to simplify setup and enable operation in virtual machines.

  1. uio_pci_generic
    • Default option
    • Lowest performance
    • Simpler kernel configuration
    • Admins should disable IOMMU with intel_iommu=off
  2. igb_uio
    • Higher performance
    • Suitable when devices lack support for legacy interrupts
    • Simpler kernel configuration
  3. vfio-pci
    • Highest security and performance on a bare metal host
    • Sseful when UEFI secure boot is enabled which may disallow the use of UIO modules on the system
    • Requires an IOMMU with intel_iommu=on iommu=pt

To configure networking-ovs-dpdk for a particular driver, add the following to devstack/local.conf:

OVS_INTERFACE_DRIVER=<driver_name>

For more information, refer to https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

CPU isolation

For maximum performance with OVS-DPDK, use CPUs that have the fewest other processes contending for their resources. In Linux, you can do this by removing a set of CPUs from the pool available to the Linux scheduler with the following steps:

  1. Remove cores from the kernel
  2. Assign cores to OVS and PMD threads
  3. Assign cores to nova cpu_shared_set
  4. Assign cores to nova cpu_dedicated_set

Example: the system has 2 sockets with 14 cores each. The following is one possible allocation, and will be assumed for the rest of this section:

  • Cores 0,1,14,15 are available for the kernel scheduler.
  • Cores 2,3 are assigned to OVS threads.
  • Cores 4,5 are assigned to PMD threads.
  • Cores 6-13 are available for DPDK-enabled VMs.
  • Cores 16-27 are available for non-DPDK VMs.

For optimal DPDK network performance, the vCPUs used by OpenStack VMs should use physical cores that have been isolated from the host system. This is accomplished with kernel command line flags that inform the host system kernel that certain CPU cores should not have processes scheduled on them. In addition to the hugepage flags described above, add the following to the kernel command line: isolcpus=2-13 For a 14-core CPU, the above flag will remove pCPUs 2-13 from the scheduler but these CPUs will be available for OpenStack VMs.

One important thing to note here is that VMs on one NUMA node should have interfaces associated to PMD threads in that same NUMA node, otherwise there will be a significant a performance penalty.

OpenStack nova has two classifications describing how pCPUs should be allocated: shared, and dedicated. Shared pCPUs may be time-sliced to share among multiple VMs, whereas dedicated pCPUs are assigned solely to a single VM. To enable specific CPUs as shared or dedicated, use the following in nova configuration or devstack local.conf (again, using the 14-core CPU from above):

  • cpu_shared_set = 16,17,18,19,20,21,22,23,24,25,26,27
  • cpu_dedicated_set = 6,7,8,9,10,11,12,13

The above configuration tells nova that any CPU in the shared set may be shared among multiple VMs. Of the CPUs listed in the dedicated set, however, each may only be assigned to a single VM. Note that we have additionally included CPUs from the second NUMA node in the shared set, but not the dedicated set. This is because the DPDK-enabled interface in this example is on NUMA0 so any VM using DPDK should also be on that interface to prevent the need to pass data between sockets.

It's also important to note that pCPUS that are intended to be used by the DPDK PMD threads should be set in the devstack local.conf file:

Set affinity for OVS threads to cores 2 & 3 (0000 1100 = 0xC): OVS_CORE_MASK=0xC

Set affinity for PMD threads to cores 4 & 5 (0011 0000 = 0x30): OVS_PMD_CORE_MASK=0x30

For more information on setting OVS parameters in local.conf, refer to https://opendev.org/x/networking-ovs-dpdk/src/branch/master/doc/source/usage.rst

Set maximum MTU

With DPDK assisting network operations, network interfaces are able to improve their bandwidth, increasing the rate at which network packets may be sent onto the wire. But a typical network configuration limits the maximum throughput. In a typical ethernet network, the maximum transmission unit (MTU) is usually set to 1500 bytes. This means that packets larger than the MTU will be fragmented and sent as multiple packets, at the cost of aggregate throughput. To overcome this limitation, we can enable jumbo ethernet frames by modifying the MTU for every interface belonging to each “bottlenecked” datapath, be it DPDK or Linux interfaces, or interfaces inside the VM itself. Jumbo frame’s MTU size should be set to a higher value, up to 9000 bytes. Larger values for the MTU should enable increased bandwidth, but this needs to be adjusted for the local network environment since switches in the network also need to enable jumbo frames or risk further packet fragmentation.

To set MTU for a non-DPDK interface, execute the following command:

sudo ip link set mtu 9000 dev enp4s0f0

To set MTU for an OVS-DPDK interface, execute the following command:

sudo ovs-vsctl -- set Interface "enp4s0f0" mtu_request=9000

For more information, refer to http://docs.openvswitch.org/en/latest/topics/dpdk/jumbo-frames/ and to https://software.intel.com/en-us/articles/jumbo-frames-in-open-vswitch-with-dpdk.

Conclusion

We hope that your experience with OpenStack, Open vSwitch and DPDK has been made easier and that your overall performance was improved with the recommendations we’ve provided.