Recent Enhancements in Intel® Virtualization Technology for Directed I/O (Intel® VT-d)
Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) architecture was first introduced in 2006 with direct memory access (DMA) remapping as one of the main features. Since then, many other features have been added, such as interrupt remapping, x2apic support (2007), Shared Virtual Memory (SVM) (2013), Posted Interrupt Support (2014), Five-Level Paging support (2017), and most recently, updates to support scalable I/O virtualization.
These newest updates support more scalable I/O virtualization and significantly improve the ability to scale device virtualization to modern data center usages, revolutionizing cloud and data center I/O. In this article, we will highlight some of these new I/O Memory Management Unit (IOMMU) features. For more details, please refer to the architecture specification.
This article is the first in a series of planned articles. As new features are upstreamed and merged into the Linux* kernel, we will also describe how these features have been implemented by supplementing with additional articles as needed.
Direct Device Assignment
With the advances in virtualization technologies, data centers are centering around server consolidations. Processor virtualization technologies, such as Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x), provide good execution isolation environments. Prior to the introduction of IOMMU, I/O devices were shared between different guest operating systems in conjunction with some form of software-based virtualization, such as the virtio mechanisms for network and disk device drivers. With the introduction of the IOMMU that provides DMA isolation, it was possible to directly assign devices to specific guest OSes. This allowed full isolation of hardware resources, while providing a more direct path to the device and eliminating overheads associated with software based techniques. The following figure shows how devices can be used inside the Virtual Machine (VM) with minimal host Virtual Machine Monitor (VMM) intervention. This approach eliminates most software overheads and provides near native performance within the VM.
Figure 1. Virtualization usage of DMA remapping
Single Root I/O Virtualization (SR-IOV) Limitations
DMA remapping enables hardware-based I/O virtualization technologies, such as PCIe* SR-IOV, where a single device physical function (PF) can be configured to support multiple virtual functions (VF). Each Virtual Function looks and feels like a real PCI endpoint from a software point of view. Since each VF has a unique Requester ID (RID) Bus/Device/Function (BDF), each VF can be assigned to a separate guest. The DMA remapping functionality provides isolation per RID.
While SR-IOV enabled the ability to partition a device and provide direct access to VMs, it also imposed scalability limitations to modern cloud and containerized environments. For instance:
- Device BARs and headers must be duplicated for every VF.
- PCIe limits resources such as MSIx to maximum of 2048 vectors.
- BIOS Firmware must reserve a number of resources such as mmio ranges, bus ranges to accommodate devices of any capability to be hotplugged.
SR-IOV implementations typically provide only a small number of VFs due to the above resource requirements. Typical SR-IOV devices only support 64 or less VFs per physical device. Light-weight containerized usages in modern cloud environments expect to have thousands of containers and therefore will put pressure on potentially scarce resources. In these environments, SR-IOV will not scale.
Limitations of SR-IOV based implementations include:
- Scalability - Unable to scale to hyperscale usages (1000+ VMs/Containers) due to cost implications for having increased memory on board and limitations on BUS numbers in certain platforms.
- Flexible Resource Management - SR-IOV requires resources such as BUS numbers and mmio ranges to use the newly created VFs. Typically, the resources are spread evenly between each of the VFs. Although it's possible to conceive of such variable resource assignments to different VFs, it imposes hardware complexity which would increase hardware cost. For instance, being able to create a device with 2 hardware queues for one VF, and 4 queues on the same physical device for another VF is generally not implemented.
- Composability - the motivation of SR-IOV is to enable direct VF pass-through. The guest driver has full control on the assigned VF device which the host/hypervisor has no insight into. This makes it difficult to perform live migration or snapshot VF device state.
Even with these limitations, SR-IOV has worked well in traditional VM usage. However, this approach no longer meets the scaling requirements for containerized environments.
Intel® Scalable I/O Virtualization (Intel® Scalable IOV)
Intel introduced the recent update to Intel® VT-d that allows for fine-grained capacity allocation. More specifically, it allows software to compose virtual devices with different capacity or capability. For instance, it’s not required to replicate hardware like SR-IOV devices. Intel® Scalable IOV allows software to compose a virtual device on demand. The virtual device provisioned via software allows most device access to be separated into slow path (configuration) and fast path (I/O). Any activity that involves configuration and control is done by software mediation. Fast path I/O is performed directly to hardware with no software intervention. This allows resources such as queues to be bundled on demand and such usage can fit either full machine virtualization or native container type usages.
Intel® Scalable IOV requires changes in the following areas:
- Device Support - A device should support Process Address Space ID (PASID). The PASID is a 20 bit value that is used in conjunction with the Requester ID. PASID granular resource allocation and proper isolation requirements are identified in the Intel® Scalable I/O Virtualization Technical Specification.
- The Interrupt Message Store (IMS) provides devices the flexibility to dictate how interrupts are specified without limitations on how many and where the message address and data are stored.
- Platform Support - DMA remapping hardware should support PASID granular DMA isolation capability.
- System Software - Support in the Operating System to provide abstractions that allow such devices to be provisioned for a Guest OS, or native process consumption.
Intel® Scalable IOV addresses the aforementioned limitations observed on PCIe* SR-IOV:
- Scalability - supports finer-grained device sharing. For example, on a NIC with 1024 TX/RX queue pairs, each queue pair can now be independently assignable.
- Resource management - software fully manages and maps backend resources to virtual devices. This provides great flexibility for heterogeneous configurations (different resources, different capabilities, and others.)
- Composability - mediation of the slow-path allows the host/hypervisor to capture the virtual device state to enable live migration or snapshot usages. Also state save/restore is required only on a small piece of device resource (queue, context, etc.), which can be implemented more easily on a device as compared to requiring the entire VF state to be migratable.
Key Feature Differences in Intel® VT-d 3.0
The recent update to the Intel® VT-d specification introduces several new features. Some of the key features that allow the hardware to scale to provide the scalability and composability benefits are listed below.
Extended Capability Register (Section 10.4.3 in the Intel® VT-d Architecture Specification)
- Scalable Mode Translation Support (SMTS) - Indicates the ability to support PASID granular device isolation.
- Virtual Command Support (VCS) - Virtual register intended to help support virtualization of the IOMMU. Unlike an SR-IOV device where an entire device is exposed to a guest, the new model creates device instances using PASID. This requires the PASID to be a flat global space which requires the guest and host PASIDs to be the same. Only virtual IOMMUs exposed to a guest would enumerate this capability. It provides an interface to for the host to control allocation of PASIDs in a guest OS.
- Second Level Accessed/Dirty Support (SLAD) - Allows hardware implementations supporting this feature to track pages that were touched/modified by the device for second level paging structures. This allows VM migration support to manage memory migration when using direct assigned devices.
- Scalable Mode Page Walk Coherency (SMPWC) - Capability that indicates that the IOMMU page walks are cache coherent. Isochronous devices could benefit from page table entries not being snooped that reduces memory traffic. For such devices, the control in PASID context needs to be cleared. We expect most devices to have this set.
- PASID granular First level Translation Support (FLTS) and Second Level Translation Support (SLTS). This is one of the key capabilities that allow devices to be composed to different VMs with different PASIDs. This allows the IOMMU to support PASID granular second level without First level support. It permits the use of Scalable IOV without the need to support Shared Virtual Memory.
- Some hypervisor implementations might choose to support virtual IOMMU (vIOMMU) only for SVM purposes with no intention to expose IOVA (or second level) translations to the Guest OS. For such implementations, the VMM has the flexibility to only expose the presence of first level table by setting FLTS and SLTS=0, indicating the guest cannot create any second level translations. This allows the guest to enable SVM.
Scalable Mode Translation
Scalable mode (SM) translation refers to the ability to have PASID granular context in both First Level (FL) and Second Level (SL).
Here are the key differences between Scalable Mode (SM) and the legacy Extended Context Mode (ECS):
- Shared Virtual Memory Support - An earlier version of the Intel® VT-d specification introduced Extended Context in order to support SVM capable devices. With the introduction of Scalable Mode Support, legacy EC support has been deprecated from the specification. There are no devices in production that supported it, and its limitations are addressed via Scalable Mode. Scalable Mode addresses two main issues to provide the flexibility to enable composable devices.
- A two level Process Address Space ID (PASID) table. This eliminates the requirement for a large monolithic table to hold the PASID table for the IOMMU. It allows system software to grow PASID tables dynamically and reduces memory usage.
- In addition, each PASID entry contains a second level page table. This permits software to partition a device using the PASID for isolation versus using a PCI RID (Bus/device/function).
Figure 2. SVM support via ECS (deprecated)
In the previous figure, the First Level (FL) is indexed by the Virtual Address shared with the CPU. The Second level (SL) are tables indexed by Guest Physical Address when direct assigned to a VM. When the device is natively used by the OS, SL is the IO Virtual Address (IOVA) constructed by the DMA APIs, for example,
Scalable Mode translation is shown in the following figure.
Figure 3. SVM support via new scalable mode
The key differences between the deprecated ECS mode and Intel® Scalable IOV mode are:
- The new RID_PASID field in the context entry is used by the IOMMU when translating DMA requests without PASID. Essentially the IOMMU always performs translations with a PASID. Either the device is natively PASID capable or one is assigned by the IOMMU driver. For a new class of PCI devices that allows itself to be assignable to a guest OS, software can create Assignable Device Interfaces (ADI) at a PASID granularity. A PASID is a 20 bit number and managed by the kernel or IOMMU driver.
- The new Scalable Mode PASID table is a two level table as shown above. The First level is a PASID directory table (indexed by the top 14 bits of PASID), and the second level is a PASID table (indexed by the low 6 bits of the PASID). Each PASID table has an entry for both a first level (FL) and a second level (SL) page table entries. This ability by the IOMMU to distinguish translation requests based on RID and the PASID versus depending on only RID is a key difference that allows the new usage models for hyperscale containerized usages in cloud environments.
In Figure 3, IOMMU uses the virtual address in a hierarchical fashion similar to how the CPU uses virtual address to locate the physical page. Refer to Section 3.5 in the Intel® VT-d spec. An example of how IOMMU uses a 48-bit address to a 4-byte page is shown below.
Figure 4. IOMMU virtual address usage
- Indexed by Host Virtual Address bits: Refers to the native host using SVM. The IOMMU uses CPU page-tables for performing I/O. FL contains the Host Virtual Address (HVA) to Host Physical Address (HPA) translation.
- Indexed by Guest Virtual Address bits: Refers to guest use of SVM. In this case the tables are used in a nested mode. FL translates Guest Virtual Address (GVA) to Guest Physical Address (GPA) translation. The SL translates GPA to HPA translation.
- Indexed by Guest Physical (or IOVA) Address bits: Refer to guest use of GPA (w/o vIOMMU), or guest use of IOVA (w/ vIOMMU), or host use of IOVA.
In order to accommodate the PASID based structures, IOMMU hardware support such as the invalidation architecture has been changed to include PASID during invalidations when IOMMU is configured in Scalable Mode.
With the introduction of scalable mode translation, Intel® VT-d provides a very scalable approach to enable new I/O virtualization techniques. Intel® Scalable IOV largely removes the scalability restrictions observed on PCIe* SR-IOV.
The first implementation for Linux* supporting the native IOMMU scalable mode can be found here.
Watch this space for future technical articles on Intel® VT-d. Some of our planned topics include:
- Mechanics of managing paging requests implementing Shared Virtual Memory (SVM) with Intel® VT-d
- Using Mediated Device Framework in managing Intel® Scalable IOV
- Virtual IOMMU (vIOMMU) high level architecture
- Interrupt Message Store (IMS)
- SR-IOV Primer
- Process Address Space ID (PASID) ECN
- Intel® Virtualization Technology for Directed I/O Architecture Specification
- Intel® Scalable I/O Virtualization Technical Specification
- Intel® Scalable I/O Virtualization LinuxCon Presentation by Kevin Tian