Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.
Home / OpenStack* on Intel® Architecture / Blogs / Stephenfin / 2015 / OpenStack* on Intel® Architecture - Smarter CPU Pinning
Blog

Smarter CPU Pinning

Author: 
Stephen Finucane

Virtualization promotes high resource/equipment usage. Virtualization allows for the kind of fast provisioning that large-scale applications need. Virtualization improves reliability immeasurably. Most people can agree that virtualization is great. Unfortunately, virtualization-based solutions tend toward lower performance than equivalent bare metal-based solutions. Technologies such as Intel® Virtualization Technology for IA-32, Intel® 64, and Intel® Architecture (Intel® VT-x) and Intel® Virtualization Technology for Directed I/O (Intel® VT-d), can reduce this performance penalty, but bare metal will often outperform virtualization. Without careful deployment configurations, virtualization-based solutions might also tend toward non-determinism: one can ask for something to be done and it will be done; but one cannot generally say when it will be done. For most workloads, the benefits that virtualization brings in the areas of improved scale and lowered cost significantly outweigh any performance impact. Similarly, any potential indeterminism is likely to go unnoticed. However, it can impact concepts like Network Functions Virtualization (NFV) where virtualized network functions running on industry-standard, high-volume servers are expected to replace proprietary hardware. For such latency-sensitive applications, the recommendations outlined later should be considered.

Enhanced Platform Awareness

There are many ways to deliver high-performance, deterministic virtualized solutions. Many modern Intel® processor-based server platforms include hardware capabilities to improve the performance of virtualization, and other specific tasks like crypto or networking. Exposing these features in virtualization-based environments, such as cloud deployments, is an important first step in addressing performance and latency concerns. To this end, Intel and the community have been working to drive Enhanced Platform Awareness extensions in OpenStack* to facilitate higher performing, more efficient workloads in virtualized deployments. These extensions work by exposing information about host functionality to hypervisors and virtualized guests alike. However, there is far more that can be done.

NUMA Awareness

Modern, multi-socket x86 systems use a shared memory architecture that describes the placement of the main memory modules with respect to processors in a multiprocessor system. In a NUMA-based system, each processor has its own local memory controller that it can access directly with a distinct performance advantage. At the same time, it can also access memory belonging to any other processor using a shared bus (or some other type of interconnect), but with different performance characteristics. This topology information is generally available to host operating systems and other host applications, which can use it to take advantage of the performance improvements that NUMA alignment offers. However, until recently this information was not available to the controller and therefore could not be considered when scheduling OpenStack instances. As part of the Juno and Kilo releases (link is external), work was undertaken in Nova to allow the controller to “understand” NUMA and thus allow it to allocate resources more efficiently.

CPU Pinning

Awareness of the underlying architecture is a necessity for implementing features such as CPU pinning. Support for the CPU Pinning functionality was also added to OpenStack Nova in the Juno release and was further refined in the Kilo release. In short, this feature allows a user to take the virtual CPUs (vCPUs) used by a guest and tie them to the real, physical CPUs (pCPUs) of the host. This allows the controller to direct the host to dedicate some of the many cores in the Symmetric Multi-Processing (SMP)-enabled host to a guest, preventing resource contention with other guest instances and host processes. As a result, CPU pinning can dramatically improve the performance of guests and the applications they run. However, while CPU pinning implicitly provides NUMA topology awareness to the guest, it does not provide any awareness of other technologies that can impact performance, like Simultaneous Multi-Threading (SMT), or Intel® Hyper-Threading Technology on Intel® platforms.

 Without CPU Pinning. vCPUs are “floating” across host cores.
Figure 1: Without CPU Pinning. vCPUs are “floating” across host cores.

 vCPUs are tied to pCPUs.
Figure 2: With CPU Pinning: vCPUs are tied to pCPUs.

CPU thread pinning

The thread pinning feature provides SMT-awareness in addition to the existing NUMA topology-awareness. SMT is different from SMP in that SMT cores share a number of components with their thread siblings (SMP cores only share buses and some memory). CPU thread pinning differs from CPU Pinning in how pCPU-vCPU bindings are chosen. With the latter, pCPUs are chosen randomly or linearly. No consideration is given to the differences in performance between SMP cores and SMT thread siblings. Thread pinning adds this functionality and allows a user to choose a host with or without this feature.

CPU thread pinning provides awareness of thread siblings to the scheduler. There are three possible configurations that cater to different requirements and workloads. As stated above, thread siblings include a number of components. In some cases, this sharing might not matter and might even benefit performance, and therefore thread siblings are used (this is the require case). Other workloads might see performance impacted due to this contention, so non-thread siblings or non-SMY hosts are used (this is the isolate case). Finally, some workloads might not care in which case best effort can be used (the prefer case).

 The 'require' case. Thread siblings are used.
Figure 3: The 'require' case. Thread siblings are used.

 The 'require' case (cont). Scheduling will fail if thread siblings are not available.
Figure 4: The 'require' case (cont). Instances are placed on different cores and siblings are reserved.

 The 'isolate' case. Scheduling will fail if there are not enough completely free CPUs.
Figure 5: The 'isolate' case. Scheduling will fail if there are not enough completely free CPUs.

 The 'prefer' case. Thread siblings are used where available.
Figure 6: The 'prefer' case. Thread siblings are used where available.

 The 'prefer' case (cont). Non-siblings will be used if siblings are not free.
Figure 7: The 'prefer' case (cont). Non-siblings will be used if siblings are not free.

This feature was part of the original CPU pinning blueprint (link is external) but was not implemented. As a result, a new blueprint was proposed (link is external) specifically to fill this gap, and code was later submitted to implement this. This implementation works by utilizing a combination of modifications to Nova filters and modifications to the current libvirt-scheduling code in Nova. The modifications to the Nova filters are used to provide per-host checks for SMT support (or lack thereof - the require case ostensibly requires SMT on the host). Filters are used to ensure that unsuitable hosts are quickly ruled out. The modifications to the libvirt-scheduling code in Nova form the bulk of the change, implementing fitting algorithms that fit a requested guest topology into a host topology (which comes replete with information on the various host cores and their siblings). This code was submitted and merged during the Mitaka release.

How to Use It?

To validate this feature, an SMT-capable node is required. The instructions below validate using a single, all-in-one node but can be adjusted for different configurations. SMT must be enabled on the node(s) to test the separate, isolate, and prefer cases; it should be disabled for the avoid case. This can be confirmed by running lscpu on the node.

$ lscpu | grep -i -E "^CPU\(s\):|core|socket"
CPU(s):                 8
Thread(s) per core      2
Core(s) per socket:     4
Socket(s):              1

In this example, the platform uses a quad-core (four cores per socket), single-socket, SMT-enabled (two threads per core) processor. Given that this requirement has been satisfied, the next step is to create a new flavor. This flavor should be used for all VMs that you want to be pinned. A single new flavor is created below, but it is possible to modify existing flavors or create multiple new flavors as desired.

$ nova openstack flavor create --ram 2048 --vcpu 4 pinned

This flavor should be modified to include the required metadata for both the CPU policy and CPU threads policy. The require case is demonstrated here, but it is possible to experiment with other cases.

$ nova flavor-key pinned set hw:cpu_policy=dedicated
$ nova flavor-key pinned set hw:cpu_threads_policy=require

Finally, an instance should be booted using this flavor. Adjust the image name as appropriate.

$ openstack server create --image cirros-0.3.2-x86_64-uec --flavor pinned test_pinned_vm_a --wait

Let this instance boot. Once booted, you can use virsh on the node to validate that the placement has occurred correctly.

$ virsh list
Id         Name                               State
----------------------------------------------------
 1         instance-00000001                  running

$ virsh dumpxml 1
    ...
    <cputune>
      <vcpupin cpuset="0" vcpu="0">
      <vcpupin cpuset="4" vcpu="1">
      <emulatorpin cpuset="0,4">
    </emulatorpin></vcpupin></vcpupin></cputune>
    ...

In this case, libvirt has not only pinned the vCPUs to pCPUs, but it also has pinned these vCPUs on thread siblings. This is the expected behavior. Attempting to boot a second instance will result in a similar output but with different cores used.

$ openstack server create --image cirros-0.3.2-x86_64-uec --flavor pinned test_pinned_vm_a –wait
$ virsh dumpxml 2
    ...
    <cputune>
      <vcpupin cpuset="1" vcpu="0">
      <vcpupin cpuset="5" vcpu="1">
      <emulatorpin cpuset="1,5">
    </emulatorpin></vcpupin></vcpupin></cputune>
    ...

There is no overprovisioning for pinned instances, therefore it is possible to boot two more instances on this particular platform before consuming all available resources. Note that unpinned instances will not respect pinning and might utilize these CPUs. Host aggregates should be used to isolate the high performance hosts used for pinned instances from the general use hosts used for unpinned instances.

Summary

The CPU thread pinning feature is available for anyone to check out and experiment with right now. Comments and other feedback are welcome. This feature should further help users deploy high-performance, highly scalable workloads on OpenStack-powered clouds. Stay tuned for more information.

Archive

Further Reading

Author: 
Nicole Huesman

This coming week, our team will share how Intel is helping address the requirements demanded by data-centric, compute-intensive workloads quickly growing across data center and edge.

Author: 
Manjeet Bhatia

Devops principles, like continuous integration (CI) and continuous delivery (CD), are attracting considerable attention given their propensity to increase software development efficiency and facili