Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Resource Allocation in Intel® Resource Director Technology

BY Fenghua Yu ON Mar 29, 2017

Abstract

The Intel® Resource Director Technology (Intel® RDT) feature set provides two sets of capabilities: allocation and monitoring. This article describes how to allocate Intel® RDT resources in Linux*. Linux kernel 4.10 introduces the first implementation of Intel RDT allocation infrastructure under the resctrl file system along with features including L3 CAT (Cache Allocation Technology), L3 CDP (Code and Data Prioritization), and L2 CAT. Linux kernel 4.12 is on-track to implement support for MBA (Memory Bandwidth Allocation). 

Introduction

Quality of Service (QoS) techniques [1][2][3][4] have been proposed to address shared resource contention between co-running applications or virtual machines in cloud/server environments. By reducing the resource contention from noisy neighbors, the QoS techniques provide better resource isolation and consistent performance, avoid periodic long latency to meet Service Level Agreements (SLA), and potentially improve overall throughput.

Intel RDT provides a set of allocation (resource control) capabilities including Cache Allocation Technology (CAT), Code and Data Prioritization (CDP), and Memory Bandwidth Allocation (MBA). The Intel® Xeon® processor E5-xxxx v4 family (and certain communication-focused processors from the Intel® Xeon® E5-xxxx v3 family) introduce capabilities to configure and make use of the CAT mechanisms on the L3 cache. Some Intel® platforms might also provide support for control over the L2 cache (in particular the Intel Atom® processor family). Separately, the MBA feature provides approximate and indirect core-level control over the memory bandwidth available to CPU cores.

To enable resource allocation in Linux, we introduce resctrl as an interface between the kernel and user space. L3 CAT, L3 CDP, and L2 CAT and extensible resctrl infrastructure are enabled in Linux kernel 4.10. MBA support is under development [6][7][8] and is on track to be in Linux kernel 4.12. MBA support leverages the existing Intel RDT and resctrl infrastructure. 

Intel® Resource Director Technology (Intel® RDT) allocation architecture

The fundamental goal of Cache Allocation Technology is to enable resource allocation based on application priority or Class of Service (COS or CLOS). Applications or individual threads can be assigned into a set of Classes of Service that the processor exposes. Cache allocation for the respective applications or threads is then restricted based on the class with which they are associated. Each Class of Service can be configured using capacity bitmasks (CBMs) which represent capacity and indicate the degree of overlap and isolation between classes. For each logical processor there is a register exposed (referred to here as the IA32_PQR_ASSOC MSR or PQR) to allow the operating system (OS) or virtual machine manager (VMM) to specify a CLOS when an application, thread or virtual machine (VM) is scheduled.

Figure 1 demonstrates the L3 CAT hardware work flow. The CLOSID field in PQR MSR indexes to the IA32_L3_MASK_n array to get the CBM, which specifies the allocated portion of L3. The OS or VMM sets up the IA32_L3_MASK_n array and assigns the CLOSID to a task before a task is created. The CLOSID is read to the PQR MSR during the task’s context switch. We will discuss the software usage in later sections.

 

Figure 1: L3 CAT

The usage of Classes of Service (COS) are consistent across resources —a COS might also have multiple resource control attributes attached, which reduce software overhead at context swap time. For example, rather than adding new types of COS tags per resource, the COS management overhead is constant. Cache allocation for the indicated application/thread/VM is then automatically controlled by the hardware based on the class and the bitmask associated with that class.

Figure 2 demonstrates the work flow for allocating two resources: L3 and Memory Bandwidth. When a task is scheduled on a CPU, the CLOSID is loaded to the PQR MSR. The same CLOSID in the PQR MSR register indexes to the IA32_L3_QoS_MSR_n MSR array to locate CBM for the L3 CAT and the IA32_L2_QoS_Ext_BW_Thrtl_n MSR array to locate the delay value for MBA. The CBMs and the delay values are set up by OS/VMM before the task is created and can be dynamically changed during the task’s runtime.

The CBMs and the delay values are set up by OS/VMM before the task is created and can be dynamically changed during the task’s runtime. The example in the figure shows CLOSID 1 in PQR MSR indexes to IA32_L2_QoS_Ext_BW_Thrtl_n MSR array to fetch delay value 10 and IA32_L3_QoS_MASK_n MSR array to fetch CBM 0x00FF0. Then when a task is running on CPU1, memory access is delayed by 10 and it can only use the the allocated L3 portion represented by CBM=0x00FF0. 

 

Figure 2: Multi Resources Allocation: L3 CAT and MBA

Enabling Intel RDT resource allocation in Linux

Intel RDT resource allocation infrastructure, including L3 CAT/CDP and L2 CAT, was first enabled in Linux* kernel 4.10. The hardware and software infrastructure is extensible to future features as well.

The infrastructure is based on the resctrl file system, which acts as user interface. The user, usually a system administrator, allocates resources through the resctrl user interface, which calls the kernel to configure QoS MSRs and CLOSIDs for tasks or CPUs. When a CPU schedules a task to run, the task’s CLOSID is loaded to PQR MSR as part of the context switch. When the task is running, the allocated cache is specified by a CBM indexed by the CLOSID, or the allocated memory bandwidth is specified by a delay value indexed by the CLOSID as well.

Resctrl file system

This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the x86 /proc/cpuinfo flag bits rdt, cat_l3, and cdp_l3.

Use the following command to mount the file system and use the feature:

 # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl

The mount option, cdp, enables code/data prioritization in L3 cache allocations.

Once mounted, the /sys/fs/resctrl directory has an info directory, a tasks file, a cpus file, and a schemata file.

Info directory

The info directory contains information about the enabled resources. Each resource has its own subdirectory. The subdirectory names reflect the resource names. Each subdirectory contains the following files:

  • num_closids: The number of CLOSIDs that are valid for this resource. The kernel uses the smallest number of CLOSIDs of all enabled resources as a limit.
  • cbm_mask: The bitmask that is valid for this resource. This mask is equivalent to 100%.
  • min_cbm_bits: The minimum number of consecutive bits that must be set when writing a mask.

Resource groups

Resource groups are represented as directories in the resctrl file system. The default group is the root directory. The system administrator can create other groups can as desired by using the mkdir(1) command, and removed using the rmdir(1) command.

Three files are associated with each group:

  • tasks: A list of tasks that belongs to this group. Tasks can be added to a group by writing the task ID to the tasks file (which automatically removes them from the group they previously belonged to). New tasks created by fork(2) and clone(2) are added to the same group as their parent task. If a PID is not in any subpartition, it is in the root or default partition.
  • cpus: A bitmask of logical CPUs assigned to this group. Writing a new mask can add or remove CPUs from this group. Added CPUs are removed from their previous group. Removed CPUs are given to the default (root) group. You cannot remove CPUs from the default group.
  • schemata: A list of all the resources available to this group. Each resource has its own line and format. See below for details.

The following rules define what resources are available to running tasks:

  1. If the task belongs to a non-default group, then the schemata for that group is used.
  2. If the task belongs to the default group but is running on a CPU that is assigned to some specific group, then the schemata for the CPU's group is used.
  3. Otherwise the schemata for the default group is used.

Schemata files general concepts

Each line in the file describes one resource. The line starts with the name of the resource, followed by specific values to be applied for each instance of that resource on the system.

Cache IDs

Current generation systems contain one L3 cache per socket, and L2 caches are generally just shared by the hyperthreads on a core, but this isn't an architectural requirement. We could have multiple separate L3 caches on a socket, multiple cores could share an L2 cache. So instead of using socket or core to define the set of logical CPUs sharing a resource, we use a Cache ID. At a given cache level this will be a unique number across the whole system (but it isn't guaranteed to be a contiguous sequence, and there might be gaps). To find the ID for each logical CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id.

Cache Bit Masks (CBM)

For cache resources, we describe the portion of the cache that is available for allocation using a bitmask. The maximum value of the mask is defined by each CPU model (and might be different for different cache levels). It is found using CPUID, but is also provided in the info directory of the resctrl file system in info/{resource}/cbm_mask. The RDT architecture requires that these masks have all the 1 bits in a contiguous block. So 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 and 0xA are not. On a system with a 20-bit mask, each bit represents 5% of the capacity of the cache. You could partition the cache into four equal parts with the following masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

L3 details (code and data prioritization disabled)

With CDP disabled, the L3 schemata format is:

L3:<cache_id0>=<'cbm>;<cache_id1>=<cbm>;...

L3 details (CDP enabled via mount option to resctrl)

When CDP is enabled, L3 control is split into two separate resources so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

L2 details

L2 cache does not support code and data prioritization, so the schemata format is always:

L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

Example 1

On a two socket machine (one L3 cache per socket) with just four bits for cache bit masks:

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

The default resource group is unmodified, so we have access to all parts of all caches (its schemata file reads L3:0=f;1=f).

Tasks that are under the control of the p0 group can only allocate from the "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. Tasks in the p1 group use the "lower" 50% of cache on both sockets.

 

Figure 3: Three L3 partitions on two sockets platform in Example 1

Example 2

Here we have two sockets again, but this time with a more realistic 20-bit mask.

Two real time tasks, pid=1234 running on processor 0 and pid=5678 running on processor 1 on socket 0 on a two-socket and dual core machine. To avoid noisy neighbors, each of the two real-time tasks exclusively occupies one quarter of the L3 cache on socket 0.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff;1=fffff" > schemata

Next we make a resource group for our first real time task and give it access to the "top" 25% of the cache on socket 0.

# mkdir p0
# echo "L3:0=f8000;1=fffff" > p0/schemata

Finally we move our first real-time task into this resource group. We also use taskset(1) to ensure the task always runs on a dedicated CPU on socket 0. Most uses of resource groups will also constrain which processors that tasks can run on.

# echo 1234 > p0/tasks
# taskset -cp 1 1234

The same applies for the second real-time task (with the remaining 25% of cache):

# mkdir p1
# echo "L3:0=7c00;1=fffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2 5678

 

Figure 4: L3 allocation for two processes on two sockets platform in Example 2

Example 3

A single socket system that has real-time tasks running on cores 4 through 7 and non-real-time workload assigned to cores 0 through 3. The real-time tasks share text and data, so a per-task association is not required. Due to interaction with the kernel, it's desired that the kernel on these cores shares L3 with the tasks.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff" > schemata

Next we make a resource group for our real-time cores and give it access to the "top" 50% of the cache on socket 0.

# mkdir p0
# echo "L3:0=ffc00;" > p0/schemata

Finally we move cores 4 through 7 over to the new group and make sure that the kernel and the tasks running there get 50% of the cache.

# echo C0 > p0/cpus

 

Figure 5: Dedicate L3 portions for CPU sets in one socket platform in Example 3

Locking between applications

Certain operations on the resctrl filesystem (for example, read/writes to/from multiple files) must be atomic. Locking is needed, and you can do it with flock.

Performance improvements

CAT is used to reduce cache contention and improve overall performance in high performance computing (HPC) and cloud environments. Reference [9] demonstrates cache contention reduction and performance improvements in benchmarks Spec CPU2006, bzip2, DPDK, etc. by CAT. Reference [10] shows a dramatic latency drop in the YouTube* transcoder by allocating a dedicated cache for the workload using the resctrl interface to reduce noisy neighbors.

One example was shown in [9] using the SPEC CPU2006 suite. Figure 6 illustrates the results for four applications run simultaneously and shows that the performance slowdown for a candidate application can be nearly four-and-a-half times higher without Cache Allocation Technology. In Figure 7, 6 MB of L3 is allocated to the high priority instance of application and the remaining 2 MB of L3 is allocated to the three instances of noisy neighbor applications. This performance slowdown can be addressed using CAT since the high priority application can run in a dedicated portion of L3 without being disturbed by other noisy neighbors.

Take a closer look at the worst case: running only one instance of sphinx3 compared to running one high priority instance of sphinx3 along with three other identical leslie3d instances as noisy neighbors. If there is no CAT (as shown in Figure 6), all instances compete for shared L3 resources and low priority instances can evict cache lines used by high priority instance and slow down the high priority instance by nearly four-and-a-half times. But if CAT is used (as seen in Figure 7), one dedicated portion of L3 (CBM=0xFFF0 and 6 MB) is allocated to the high priority instance and all other instances run in the remaining portion of L3 (CBM=0x000F and 2 MB). Although the high priority instance is not allocated the L3 full cache in Figure 7 compared to Figure 6, (6 MB is less than 8 MB), the slowdown of the instance is reduced to one and a half times because there are no noisy neighbors to compete for its cache. Some slowdown remains due to other shared resources such as memory bandwidth, which can be reduced using Memory Bandwidth Allocation (MBA), although this feature is not discussed in detail in this article.

 

Figure 6: Four CPU2006 Applications Running on Haswell Client

Figure 7: CAT Significantly Reduces Contention for Four CPU2006 Applications Running on Haswell Client

References

  1. PCASA: Probabilistic Control-Adjusted Selective Allocation for Shared Caches - K. Aisopos, J. Moses, R. Illikkal, R. Iyer, D. Newell, Proceedings of the Conference on Design, Automation and Test in Europe, DATE 2012.
  2. Cloud as a Service: Understanding the Service Innovation Ecosystem - Enrique Castro-Leon, Robert Harmon, 2016
  3. Noisy Neighbors, Isolation and QoS in Cloud Infrastructures -
  4. Improving users’ isolation in IaaS: Virtual Machine Placement with Security - Constraints Eddy Caron, Jonathan Rouzaud-Cornabas, 2014 IEEE International Conference on Cloud Computing
  5. Intel 64 and IA-32 Architectures Software Developer’s Manual
  6. Resource Allocation: Intel Resource Director Technology (RDT) – Fenghua Yu, July 14, LinuxCon+ContainerCon Japan 2016
  7. Resource Allocation: Intel Resource Director Technology (RDT) – Fenghua Yu, August 23, LinuxCon+ContainerCon North America 2016
  8. Linux Kernel Source Code
  9. Cache QoS: From Concept to Reality in the Intel® Xeon® Processor E5- 2600 v3 Product Family Andrew Herdrich, Edwin Verplanke, Priya Autee, Ramesh Illikkal, Chris Gianos, Ronak Singhal, Ravi Iyer, IEEE Xplore 2016
  10. CAT @ Scale Deploying Cache Isolation in a Mixed-workload Environment - Rohit Jnagal, David Lo, August 22, LinuxCon+ContainerCon North America 2016