Assignable Interfaces in Intel® Scalable I/O Virtualization in Linux*
Intel® Scalable I/O Virtualization (Intel® Scalable IOV) is a set of technologies which aims to be a lightweight and scalable approach for hardware-assisted I/O virtualization. A summary of recent enhancements to Intel® Virtualization Technology (Intel® VT-d) hardware is described here: Recent Enhancements in Virtualization Technology for Directed I/O. This article focuses on the creation and management of Scalable IOV Assignable Device Interfaces (ADI) via the Linux Mediated Device Framework.
The Intel Scalable IOV-capable device can be configured to group its resources into multiple isolated Assignable Device Interfaces (ADIs). Direct Memory Access (DMA) transfers from/to each ADI are tagged with a unique Process Address Space identifier (PASID) number. The device’s capability depends on a new mode in Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel VT-d), called Scalable Mode, which is defined in Intel VT-d 3.0. When Intel VT-d works in this mode, the DMA request is translated in PASID granularity. This means DMA requests tagged with different PASID could be isolated and protected against each other. This makes it possible to assign an ADI to a user level application or Virtual Machine (VM), in the same way as a PCIe* device or any virtual function (VF) of a Single-Root I/O Virtualization (SR-IOV) capable device. An Intel Scalable IOV-capable device could report this capability with a PCIe* designated vendor specific extended capability (DVSEC) defined in the Intel Scalable IOV technology specification.
With the patch set of input–output memory management unit (IOMMU) aware mediated device (https://lwn.net/Articles/783892/) being merged into v5.2-rc1, Linux* becomes the first OS to support the Intel Scalable IOV technology. This article also describes how we enabled Linux to support the new Intel Scalable IOV technology.
As shown in the following diagram, an IOMMU group represents the smallest set of devices that are considered isolated, from the perspective of the IOMMU. Each device belongs to a group with two address space domains: a default domain and a primary domain. Conceptually, a domain represents an isolated IO virtual address (IOVA) space. The default domain is used on bare metal to strictly control DMA addresses for associated device (only granted system memory pages can be accessed through DMA). By using the default domain, bounce buffers are avoided and a contiguous DMA buffer can be created from multiple scattered buffers.
The device drivers use the default domain through the DMA APIs. The default domain is attached to the device during boot before loading the corresponding device driver, so DMA requests can be translated by the IOMMU. After the device is assigned to the user level for direct access, the default domain is then replaced by the primary domain, which plays a role of isolation for user space device driver or VM. Only one of the primary domain/default domains is active at any given time. After the primary domain is attached, the assigned device can only DMA access the memory within the virtual machine or application to which the device is assigned.
Traditional relationship between device and domain
When it comes to Intel® Scalable IOV-enabled devices, the situation becomes more complex, as shown in the following diagram. The device itself belongs to an IOMMU group, just like other normal PCIe devices. The default domain of the IOMMU group is valid and DMA transactions are translated through it. The primary domain is not valid because we cannot assign the whole device to user space when Intel Scalable IOV is enabled.
Since the new ADI created is isolated from every other ADI including the parent device by using a unique PASID, the ADI belongs to an IOMMU group of its own. To distinguish the isolation domain for an ADI from a normal PCI device, we introduced the concept of auxiliary domain. There is no difference between a primary domain and an auxiliary domain from an isolation perspective.
Relationship between an Intel® Scalable IOV-capable device and domains
Mediated Device Framework in VFIO
In the Linux kernel, an ADI is normally represented by a pseudo device called a mediated device (mdev), which is implemented in the virtual function IO (VFIO) component. The VFIO component is an IOMMU and device agnostic framework for exposing direct device access to user space. It supports three types of devices: PCI/PCIe* devices, platform devices, and mediated devices. The platform devices might be detected and probed, for example, through Advanced Configuration and Power Interface (ACPI) or Advanced Microcontroller Bus Architecture (AMBA). All device types are abstracted with a common user interface for life-cycle management, resource enumeration, and run-time emulation. There is also an IOMMU abstraction layer which talks to IOMMU vendor drivers for I/O page table management, as shown in the following diagram.
VFIO mediated device framework 
The mediated device framework is a subset of the VFIO and allows software-defined devices to be exposed through VFIO while the host driver manages access to the interface. The VFIO mediated device framework supports software-mediated devices, that is, relying on the Virtual Device Composition Module (VDCM) to enforce DMA isolation in a vendor-specific way. With the Intel Scalable IOV-capable device driver registering itself with the mediated device framework, an ADI can be represented by a mediated device. Because each ADI is isolated by the system IOMMU, the mdev must also work in an IOMMU protected environment. To achieve this, the IOMMU abstraction layer in the VFIO driver was extended to support three types of domain attachments.
- Software mediated device. No IOMMU involved. Relies on the VDCM to enforce DMA isolation in a vendor-specific way.
- IOMMU isolated with a primary domain. IOMMU involved. Relies on the IOMMU isolation in PCI source ID granularity.
- IOMMU isolated with an auxiliary domain. IOMMU involved. Relies on the IOMMU isolation in a granularity of PCI PASID.
The VFIO IOMMU abstraction layer uses the method shown in the following figure to distinguish different attachment types.
VFIO IOMMU abstract supports all types of isolation
How Drivers Interact with the Framework
In Linux, the consumer of the Intel Scalable IOV framework is primarily the device driver. It is interesting and helpful to look at the whole picture from the device driver’s point of view, as shown below. The figure shows all interactions between an Intel Scalable IOV-capable device driver and the components of the Intel Scalable IOV framework.
Intel® Scalable IOV-capable device driver workflow diagram
- The device driver tells the IOMMU subsystem this device works in Intel Scalable IOV mode by calling the
- The VDCM sets the IOMMU device with
mdev_set_iommu_device()when it registers into the mdev bus framework.
- The VFIO allocates an IOMMU domain with
iommu_domain_alloc()for a VFIO group.
- The VFIO attaches the IOMMU domain to the mediated device with the
iommu_aux_attach_device()function and saves the domain in mdev framework with
- The IOMMU driver allocates a PASID for the mediated device and sets up the IOMMU tables so all DMA transfers from the Intel Scalable IOV-capable device tagged with this PASID are translated through this domain.
- The IOMMU domain is saved in the mdev framework.
- Later, the device driver retrieves the attached domain from the mdev framework with
- The device driver retrieves the PASID associated with the domain from the IOMMU subsystem with the
- The device driver programs this PASID into the device register so the ADI uses this PASID for DMA.
At this time, the mainline Linux kernel does not include the API of
mdev_set/get_iommu_domain(). We plan to submit it later with a real SIOV-capable device driver. The first device driver using the Intel® Scalable IOV framework is under discussion in the community (https://lkml.org/lkml/2019/4/24/495). It is a sample driver demonstrating the usage of various APIs described in this article by wrapping a PCI device into a mediated device. People can try the framework with this driver on a mainline Linux kernel soon. Our team plans to develop more device drivers (accelerators, graphic devices, network devices, and others) based on the Intel Scalable IOV framework in the future.
-  Intel® Scalable I/O Virtualization LinuxCon Presentation by Kevin Tian
-  Intel® Virtualization Technology for Directed I/O Architecture Specification
-  Intel® Scalable I/O Virtualization Technical Specification
-  Recent Enhancements in Intel® Virtualization Technology for Directed I/O post on 01.org by Ashok Raj
-  IOMMU aware mediated device post on lwn.net by Baolu Lu