Daniel Vetter has written, on his blog, a very nifty i915/GEM Crashcourse. It is divided in 4 parts and all of them are transcripted here.
Part 1 talks about the different address spaces that a i915 GEM buffer object can reside in and where and how the respective page tables are set up. Then, it also covers different buffer layouts, as far as they're a concern for the kernel, namely how tiling, swizzling, and fencing work.
covers all the different bits and pieces required to submit work to the gpu and keep track of the gpu's progress: Command submission, relocation handling, command retiring, and synchronization are the topics.
looks at some of the details of the memory management implement in the i915.ko driver. Specifically, we look at how we handle running out of GTT space and what happens when we're generally short on memory.
Finally, part 4
discusses coherency and caches and how to most efficiently transfer between the gpu coherency domains and the cpu coherency domain under different circumstances.
This is the first part in a short tour of the Intel hardware and what the GEM (graphics execution manager) in the i915 does. See the overview
for links to the other parts.
GEM essentially deals with graphics buffer objects (which can contain textures, renderbuffers, shaders, or all kinds of other state objects and data used by the gpu) and how to run a given workload on the gpu, commonly called command submission (CS), but in the i915.ko driver, it's done with the execbuf ioctl (because the gpu commands themselves reside in a buffer object on Intel hardware).
Address Spaces and Pagetables
So, the first topic to look at is what kind of different address space we have, which different pieces of hardware can access them, and how we bind various pieces of memory into them (that is, where the corresponding pagetables are and what they look like). Contrary to discrete gpus, Intel gpus can access only system memory. So, the only way to make any memory available to the gpu is to bind a bunch of pages into one of these gpu pagetables, and we don't need it to bother us with different kinds of underlying memory as backing storage.
The gpu itself has its own virtual address space, commonly called GTT. On modern chips, it's 2GB big, and all gpu functions (display unit, render rings, and similar global resources, but also all the actual buffer objects used for rendering) access the data they need through it. On earlier generations, it's much smaller, down to a meager 32M for the i830M. On Sandybridge and newer platforms, we also have a second address space called the per-process GTT (PPGTT for short), which is of the same size. This address space can be accessed only by the gpu engines (and even there, we sometimes can't use it), so scanout buffers must be in the global GTT. The original aim of PPGTT was to insulate different gpu processes, but context switch times are high, and up to Sandybridge, the TLBs have errata when using different address spaces. The reason we use PPGTT now - it's been around as a hardware feature even on earlier generations - is that PPGTT PTEs can reside in cachable memory, so lookups benefit from the big shared LLC cache, whereas GTT PTE lookups always hit main memory.
But before we look at the where the pagetables reside, we also need to consider how the cpu can access graphics data. Because Intel has a unified memory architecture, we can access graphics memory by directly mapping the backing storage in main memory into the cpu address space. For a bunch of reasons that we'll discuss later on, it is useful if the cpu can access the graphics objects through the GTT. For that, the low part (usually the first 256MB) of the GTT can be accessed through the bar 2 pci mmio space of the igd. So, we can point the cpu PTEs at that mmio window, and then point the relevant GTT PTEs at the actual backing storage in main memory. The chip block that forwards these cpu accesses is usually called System Agent (SA). Note that there's no such window for the PPGTT, that address space can really only be used by the render part of the gpu (usually called GT). This GTT cpu window is called the mappable gtt part in our code (because we can map it from the cpu) and accessed with write-combining (wc) cpu mapping.
Now the pagetables are a bit tricky. In the end, they're all in system memory, but there are a few hoops to jump through to get at them. The GTT pagetables has just one level, so with a 4 byte entry size we need 2MB of contiguous pagetable space. The firmware allocates that for us from stolen memory (that is, a part of the system memory that is not listed in the e820 map, so it's not managed by the Linux kernel). But we write these PTEs through an alias in the register mmio bar! The reason for that is to allow the SA to invalidate TLBs. Note, though, that this only invalidates TLBs for cpu access. Any other access to the GTT (such as from the GT or the display block) has its own rules for TLB invalidation. Also, on recent generations we need to (depending upon circumstances) manually invalidate the SA TLB by writing to a magic register. To speed up map/unmap operations, we map that GTT PTE aliasing region in the mmio with wc (if this is possible, which means the cpu needs to support PAT).
PPGTT has a two level pagetable. The PTEs have the exact same bit layout as the PTEs in the global GTT, but they reside in normal system memory. The PTE pagetables are split up into pages, each of which contains 1024 entries. So, there's no need for a contiguous block of memory and they are allocated with the linux page allocator. The PDEs, on the other hand, are special. We need 512 of them to map the entire 2GB address space. For unknown reasons, the hardware requires that we steal these 512 entries from the GTT pagetable (shrinking the global GTT's useable range by 2 MB) and write them through the same mmio alias that we write the global GTT PTEs through.
A slight complication is VT-d/DMAR support, which adds another layer to remap the bus addresses that we put into (PP)GTT PTEs to real memory addresses. We simply use the common linux dma api through pci_map/unmap_sg (and variants) to use these, so DMAR support is rather transparent in our code. Well, except for the minor annoyance that the DMAR hardware interacts badly with the GTT PTE lookups on various platforms and, so, needs ridiculous amounts of horrible workarounds (down to disabling everything on GM45 because it simply doesn't work). Only on Ivybridge does DMAR seem to be truely transparent and no longer require strange contortions.
Tiling and Swizzling
Modern processors go to extreme lengths to paper over memory latency and lack of bandwidth with clever prefetchers and tons of cache. Now graphics usually deal with 2D data and that can result in some rather different access patterns. The pathological case is drawing a vertical line (if each horizontal line is contigous in memory, followed by the next one): Drawing one pixel will just access 4 bytes, then the next pixel is a few thousands bytes away. This means we'll not only incur a cache miss, but also a TLB miss. So, for a 1 pixel line crossing the entire screen, it means a few hundred to thousands cache and TLB misses, for just 1-2 KB of date. No amount of caches and TLBs can paper over that.
So, the solution is to rearrange the layout of 2D buffers into tiles of fixed size & height, so that each tile has a size of 4KB. So, as long as we stay within a tile, we won't incur a TLB miss. Intel hardware has two different tiling layouts:
- X-tiled, which is 512 bytes wide with 8 rows, where each 512 byte long logical row is contiguous in memory. This is the compromise tiling layout, which is somewhat efficient for rendering, but can still be used for scanout, where we access the 2D buffer row-by-row. A too short (or non-contiguous) row within a tile would drive power consumption (due to the more random memory access pattern) for scanout through the roof, which hurts idle power consumption.
- Y-tiled, which is 128 bytes times 32 rows. For the common 32-bit pixel layouts, this means 32x32 pixels, so nicely symmetric in both x and y direction. For even better performance, a row is split up in OWORDS (that are 16 bytes) and consecutive OWORDS in memory are mapped to consecutive rows (no columns, which would result in the contiguous rows of pixels that the X-tiled layout has). So, an aligned 4x4 pixel block matches up with a 64-byte cacheline. So, in a way, a cacheline within a Y-tile works like a smaller microtile.
There's also a special W-tile layout, which is only used by the separate stencil buffer. Safe when hacking around in the stencil readback code, we can just ignore this (because accessing the stencil buffer is all internal to the render engine). Also, some really old generations have slightly different parameters for the X and Y layouts.
An additional complexity comes with dual channel memory. For efficiency reasons, we want to read entire 64-byte block (which match the cachelines size) from a single channel. To load balance between the two channels, we therefore load all even cachelines from the first channel and all odd cachelines from the second channels. There are some additional trick to even out things, but this is the gist we need for the discussion below.
Unfortunately, this channel interleaves patterns together with the X tiling again, leads to a pathological access pattern: If we draw a vertical line in an X tile, we advance by 512 bytes for each pixel, which means we'll always hit the same memory channel (because 512 % 64 == 0). The same happens for Y tiles when drawing a horizontal line. Looking at the memory address, we see that bit 6 essentially selects which memory channel will be used, so we can even out the access pattern by XOR'ing additional, higher bits into bit 6 (which is called swizzling). XOR'ing bit 9 into bit 6 gives us a perfect checkerboard of memory channels for Y tiling. For X tile, we can't get better than, at most, 2 consecutive cachelines in any direction that use the same memory channel, which the hardware achieves by XOR'ing bit 10 and 9 into bit 6.
Addendum: On most older platforms the firmware selects the whether swizzling is enabled when setting up the memory controller and this can't be changed afterward anymore. But on Sandybridge and later, the driver controls swizzling, and enables it when it detects a symmetric DIMM configuration in the memory controller.
Now how can we tell the gpu which layout a given buffer object uses? For pretty much most gpu functions, we can directly specify the tiling layout in additional state bits (or for the display unit, in special register bits). But some functions don't have any such bits, and on older platforms that even includes the blitter engine and some other special units. For these, the hardware provides a limited set of so called fences.
The first thing to note is that Intel hardware fences are something rather different than what they normally mean around gpu drivers: Usually a fence denotes a synchronization point that is inserted into the command stream that the gpu processes, so that the cpu knows exactly up to which point the gpu has completed commands (and so, also which buffers the gpu still needs).
Fences on Intel hardware, though, are special ranges in the global GTT that make a tiled area look linear. On modern platforms, we have 16 of those and can set them up rather freely, the start and end of a tiled region need only to align to page boundaries. Now most gpu GTT clients have their own tiling parameters and ignore these fences (with the exception of older platforms, where fence usage by the gpu is much more common), but fences are obeyed by the SA and so, by all CPU access that targets the mmio window into the GTT. Fences are, therefore, the first reason why cpu access to a buffer object through the GTT mmio window is useful: they detile a buffer automatically, and doing the tiling and swizzling manually is simply a big pain. One peculiarity, though, is that we can access only X and Y tiled regions, not W tiled regions, so stencil buffers need to be manually detiled and deswizzled with the cpu.
Managing the fences is rather straightforward. We keep them in an lru, and if no fence is free. We make one fence available by make sure that nothing that currently accesses this buffer objects really needs the fence. To make sure that the cpu can't access the object any more, through the mmio window, we shot down all cpu PTEs pointing at it. For older generations, where the gpu also needs fences, we keep track of the last command submission that requires it (because not all of them do, and fences are a limited resources) and wait for that to complete.
Also, some older platforms fences have a minimal size and need to be a power of two in alignment and size. To avoid the need to overallocate a texture (and so waste memory), we allow tiled objects to be smaller than the fenced region they would require. When setting up a fence, we simply reserve the entire region that the fence detiles in the GTT memory manager, and so make sure that no other object ends up in the unused, but still detiled, area (where it would get corrupted).
Now that we've covered how to make memory accessible to the gpu and how to handle tiling, swizzling, and setting up fences, the next installment of this series will look at how to submit work to the gpu and some related issues.
After the previous installment, this part will cover command submission to the gpu. See the i915/GEM crashcourse overview for links to the other parts of this series.
Command Submission and Relocations
As I've alluded already, gpu command submission on Intel hardware happens by sending a special buffer object with rendering commands to the kernel for execution on the gpu, the so called batch buffer. The ioctl to do this is called execbuf. Now this buffer contains tons of references to various other buffer objects, which contain textures, render buffers, depth & stencil buffers, vertices, all kinds of gpu-specific things, like shaders and also some state buffers, which, for example, describe the configuration of specific (fixed-function) gpu units.
The problem now is that userspace doesn't control where all these buffers are - the kernel manages the entire GTT. And the kernel needs to manage the entire GTT, because otherwise multiple users of the same single gpu can't get along. So, the kernel needs to be able to move buffers around in the GTT when they don't all fit in at the same time. This means clearing the PTEs in the relevant pagetables for the old buffers that get kicked out and then filling them again with entries pointing at new the buffers, which the gpu now requires to execute the batch buffer. In short, userspace needs to fill the batchbuffer with tons of GTT addresses, but only the kernel really knows them at any given point.
This little problem is solved by supplying a big list of relocation entries, along with the batchbuffer and a list of all the buffers required to execute this batch. To optimize for the common case, where buffers don't move around, userspace prefills all references with the GTT offsets from the last command submission (the kernel is so kind to tell userspace the updated offset, after successful submission of a batch). The kernel then goes through that relocation list and checks whether the offsets that userspace presumed are still correct. And if that's not the case, it updates the buffer references in the batch and, so, relocates the referenced data, hence the name.
A slight complication is that the gpu data structures can be several levels deep, like the batch points at the surface state, which then points at the texture/render buffers. So each buffer in the command submission list has a relocation list pointer, but for most buffers it's just NULL (because they just contain data and don't reference any other buffers).
Now, along with any information required to rewrite references for relocated buffers, userspace also supplies some other information (the read/write domains) about how it wants to use the buffer. Originally, this was to optimize cache coherency management (coherency will be covered in detail later on), but nowadays, that code is massively simplified because that clever optimized cache tracking is simply not worth it. We do, though, still use these domain values to implement a workaround on Sandybridge. Because we use PPGTT, all memory accesses from the batch are directed to go through the PPGTT, with the exception that pipe control writes (useful for a bunch of things, but mostly for OpenGL queries) always go through the global GTT (it's a bug in the hardware...). So we need to make sure that we not only bind to the PPGTT, but also set up a mapping in the global GTT. And we detect this situation by checking for a GEM_INSTRUCTION write domain - only pipe control writes have that set.
The other special relocation is for older generations, where the gpu needs a fence set up to access tiled buffers, at least for some operations. The relocation entries have a flag for that to signal to the kernel that a fence is required. Another peculiarity is that fences can be set up only in the mappable part of the GTT. At least on those chips that require them for rendering. So we also restrict the placement of any buffers that require a fence to the mappable part of the GTT.
So after rewriting any references to buffers that moved around, the kernel is ready to submit the batch to the gpu. Every gpu engine has a ringbuffer that the kernel can fill with its own commands. First, we emit a few preparatory commands to flush caches and set a few registers (which normal userspace batches can't write) to the values that userspace needs. Then, we start the batch by emitting a MI_BATCHBUFFER_START command.
Retiring and Synchronization
Now the gpu can happily process the commands and do the rendering, but that leaves the kernel with a problem: When is the gpu done? Userspace obviously needs to know this to avoid reading back incomplete results. But the kernel also needs to know this, to avoid unmapping buffers which are still in use by the gpu. For example, when a render operation requires a temporary buffer, userspace might free that buffer right away after the execbuf call completes. But the kernel needs to delay the unmapping and freeing of the backing storage until the gpu not longer needs that buffer.
Therefore, the kernel associates a sequence number with every batchbuffer and adds a write of that sequence number to the ringbuffer. Every engine has a hardware status page (HWS_PAGE), which we can use for such synchronization purposes. The upshot of that special status page is that gpu writes to it snoop the cpu caches, so a read from it is much faster than reading directly the gpu head pointer register of the ring buffer. We also add a MI_USER_IRQ command after the sequence number (seqno for short) write, so that we don't need to constantly poll when waiting for the gpu.
Two little optimizations apply to this gpu-to-cpu synchronization mechanism: If the cpu doesn't wait for the gpu, we mask the gpu engine interrupts to avoid flooding the cpu with thousands of interrupts (and potentially waking it up from deeper sleep states all the time). And the seqno read has a fastpath which might not be fully coherent, and a potentially much more expensive slowpath. This is because of some coherency issues on recent platforms, where the interrupt seemingly arrives before the seqno write has landed in the status page. Because we check that seqno rather often, it's good to have a lightweight check which might not give the most up-to-date value, but is good enough to avoid going through more costly slowpaths in the code that handle synchronization with the gpu.
So, now we have a means to track the progress of the gpu, through the batches submitted to the engine's ringbuffer, but not yet a means to prevent the kernel from unmapping or freeing still in-use buffers. For that, the kernel keeps a per-engine list of all active buffers, and marks each buffer with the seqno of the latest batch it has been used for. It also keeps a list of all still outstanding seqnos in a per-engine request list. The clever trick now is that the kernel keeps an additional reference on each buffer object that resides on one of the active list. That way a buffer can never disappear while still in use by the gpu, even when userspace removes all it's references. To batch up the active list processing and retiring of any newly completed requests, the kernel runs a regular task from a worker thread to clean things up.
To avoid polling in userspace, the kernel also provides interfaces for userspace to wait until rendering completes on a given buffer object: The wait_timeout ioctl simply waits until the gpu is done with an object (optionally with a timeout), the set_domain ioctl doesn't have a timeout, but additionally takes a flag to indicate whether userspace only wants to read or also whether it wants to write. Furthermore, set_domain also makes sures that cpu caches are coherent, but we will look at that little issue later on. The set_domain ioctl doesn't wait for all usage by the gpu to complete if userspace only wants to read the buffer. In that case, it only waits for all outstanding gpu writes. The kernel knows this, thanks to the separate read/write domains in the relocation entries, and keeps track of both the last gpu write and read by remembering the seqnos of the respective batches.
The kernel also supports a busy ioctl to simply inquire whether a buffer is still being used by the gpu. This recently gained the ability to tell userspace which gpu engine an object is busy on. This is useful for compositors that get buffer objects from clients to decide which engine is the most suitable one, if a given operation can be done with more than one engine (pretty much all of them can be coaxed into copying data).
With that, we have gpu/cpu synchronization covered. But, as just mentioned above, the gpu itself also has multiple engines (at least on newer platforms), which can run in parallel. So, we need to have some means of synchronization between them. To do that, the kernel not only keeps track of the seqno of the last batch an object has been used for, but also of the engine (commonly just called ring in the kernel, because that's what the kernel really cares about).
If a batchbuffer then uses an object which is still busy on a different engine, the kernel inserts a synchronization point: Either by employing so called hardware semaphores, which, similarly to when the kernel needs to wait for the gpu, simply wait for the correct seqno to appear, only using internal registers, instead of the status page. Or, if that's disabled, simply by blocking in the kernel until rendering completes. To avoid inserting too many synchronization points, the kernel also keeps track of the last synchronization point for each ring. For ring/ring synchronization, we don't track read domains separately, at least not yet.
Note that a big difference of GEM, compared to a lot of other gpu execution management frameworks and kernel drivers, is that GEM does not expose explicit sync objects/gpu fences to userspace. A synchronization point is always only implicitly attached to a buffer object, which is the only thing userspace deals with. In practice, the difference is not big, because userspace can have equally fine-grained control over synchronization by holding on to all the batchbuffer objects - keeping them around until the gpu is done with them won't waste any memory anyway. But the big upside is that when sharing buffers across processes, forexample with DRI2 on X or, generally, when using a compositor, there's no need to also share a sync object: Any required sync state comes attached to the buffer object, and the kernel simply Does The Right Thing.
This concludes the part about command submission, active object retiring, and synchronization. In the next installment, we will take a closer look at how the kernel manages the GTT, what happens when we run out of space in the GTT, and how the i915.ko currently handles out-of-memory conditions.
In previous installments of this series, we've looked at how the gpu can access memory and how to submit a workload to the gpu. No,w we will look at some of the corner cases in more detail. See the i915/GEM crashcourse overview
for links to the other parts of this series.
GEM Memory Management Details
First, we will look at the details of managing the gtt address space. For that, the drm/i915 driver employs the drm_mm helper functions, which serves as a chunk allocator from an address space with a fixed size. It allows constraining the alignment of an allocated block and also supports crazy gpu-specific guard-page constraints through an opaque per-allocation-block color attribute and a callback. Our driver uses that to make sure that so called snoopable and uncached regions on older generations are well separated (by one page), because the hardware prefetcher on some chip functions will fall over if it encounters the wrong type of memory while prefetching into a subsequent page.
Now let's look at how GEM handles memory management corner cases. There are two important ways of running out of resources in GEM: There could be not enough space in the gtt for a given batch, this is the ENOSPC case in the code. Or, we could run out of system memory (ENOMEM). One of the nice things of i915 GEM, is that we use the kernel's shmemfs implementation to allocate backing storage for gem objects. The big benefit is that we get swap handling for free. A bit of a downside is that we don't have much control over how and when swapping happens, and also, not really any control over how the backing storage pages get allocated. These downsides regularly result in some noises on irc channels about implementing a gemfs to fix these deficiencies, but with no patches merged so far.
Caching GEM Buffer Objects
The first complication arises because GEM driver stack has a lot of buffer object caches. The most important object cache is the userspace cache in libdrm. Because setting up a gem object is expensive (allocating the shmemfs backing storage, but also cache flushing operations are required on some generations and setting up memory mappings on first use), it's advisable to keep them around for reuse. And submitting workloads to the gpu tends to require tons of temporary buffers to upload data and assorted gpu-specific state like shaders, so those buffers get recycled quickly.
Now if we run out of memory, it would be good if the kernel can just drop the backing storage for these buffers on the floor, instead of wasting time trying to swap their data out (or even failing to allocate memory, resulting in OOM-induced hilarity). So, gem has a gem_madvise ioctl, which allows a userspace cache to tell the kernel when a buffer is only kept around opportunistically, with the I915_MADV_DONTNEEDflag. The kernel is then allowed to reap the backing storage of such marked objects anytime it pleases. It can't destroy the object themselves, though, because userspace still has a handle to them. Once userspace wants to reuse such a buffer, it mush tell the kernel by setting the I915_MADV_WILLNEED flag through the madvise ioctl. The kernel confirms that the object's backing storage is still around or whether it has been reaped. In the latter case, userspace frees the object handle to also release the storage occupied by that in the kernel. If the userspace cache encounters such a reaped object, it also walks all its cache buckets to free any other reaped objects.
The kernel itself also tries to keep objects around on the gpu for as long as possible. Any object not used by the gpu is on the inactive_list. This is the list the kernel scans in LRU order, when it needs to kick objects out of the gtt to free space for a different set of objects (for the ENOSPC case, for example). In addition, the kernel also keeps a list of objects which are not bound, but for which the backing storage is still pinned (and so, not allowed to be swapped out) on the unbound_list. This mostly serves to mitigate costly cache flushing operations, but the pinning of the backing storage is, itself, not free. We will talk about why cache flushing is required in the coherency section in the next installment of this series.
Running out of gtt space requires us to select some victim buffer objects. Then we wait for all access to the selected objects through the gtt to cease, and then unbind them from the gpu address space to make room for the new objects. So, let's take a closer look at how GEM handles buffer eviction from the gt,t when it can't find a suitable free hole. Now, setting up and tearing down pagetable entries isn't free. Buffer objects vary a lot in size and, for some access paths through the gtt, we can only use the comparatively small mappable section of it. So, just evicting objects from the inactive list in LRU order would be really inefficient, because we could end up unbinding lots of unrelated small objects sitting all over the place in the gtt, until we accidentally free a suitably large hole for a really big object. And if that big object needs to be in the contended mappable part, things are even worse.
To avoid trashing the gtt so badly, GEM uses a eviction roaster, which is implemented with a few helper functions from the drm_mm allocator library: In a first step, it scans through the inactive list and adds objects to the eviction roaster, until a big enough free hole can be assembled. Then it walks the list of just scanned objects backwards to reconstruct the original allocator state and, while doing that, also marks any objects which fall into the selected hole as to-be-freed. In the last step, it unbinds all the objects which are marked as to-be-freed, creating a suitable hole with the minimal amount of object rebinding.
If no hole can be found by scanning the inactive list, the eviction logic will also scan into the list of objects still being used by the gpu engines. Obviously, it then needs to block for rendering to complete before it can unbind these objects. And if that doesn't work out, GEM returns -ENOSPC to userspace, which indicates a bug in either GEM or the userspace driver part: Userspace must not construct batchbuffers which cannot fit into the GTT (presuming it's the only user of the gpu), and GEM must make sure that it can move anything else out of the way to execute such a giant batchbuffer.
Now this works if we only need to put a single object into the gtt, for example when serving a pagefault for a gtt memory mapping. But when preparing a batchbuffer for command submission, we need to make sure that all the buffer objects from an execbuf ioctl call are in the gtt. So, we need to make sure that when reserving space for later objects, we don't kick out earlier objects, which are also required to run this batchbuffer. To solve this, we need to mark objects as reserved and not evict reserved objects. In i915 GEM, this is done by temporarily pinning objects, which every once in a while leads to interesting discussion because the semantics of such temporarily pinned buffers get mixed up with the semantics of buffers which are pinned for an indeterminate time, like scanout buffers.
A funny exercise for the interested reader is to come up with optimal reservations schemes that allocate the gtt space for all buffers of an execbuf call at once (instead of buffer-by-buffer). This is mostly relevant on older generations, which sometimes have rather massive alignment constraints on buffers. And create clever approximations of such optimal algorithms, which drm/i915 hopefully implements lready, otherwise patches are highly welcome. Another good thing to ponder is the implications of a separate VRAM space (in combination with a GTT), which requires even more costly dma operations to move objects in, but, in return, also has much higher performance (that is, the usual discrete gpu setup). One of the big upsides of a UMA (unified memory architecture) design, as used on Intel integrated graphics, is that there's no headache-inducing need to balance VRAM vs. GTT usage, which needs to take into account dma transfer costs and delays....
Handling low memory situations is first and foremost done by the core vm, which tries to free up memory pages. In turn, the core vm can call down in GEM (through the shrinker interface) to make more space available.
Within the drm/i915 driver, out of memory handling has a few tricks of its own, all of which are due to the rather simplistic locking scheme employed by drm/i915 GEM. We essentially only have one lock to protect the all important GEM state, the dev->struct_mutex. This means that we need to allocate tons of memory when holding that lock, in turn requiring that our memory shrinker callback only trylocks that all-encompassing mutex, because otherwise we might deadlock. We also need to tell the memory allocations functions through GFP flags that they shouldn't endlessly retry allocations and even launch the OOM killer, but instead fail the allocation if there's no memory and return immediately. Because most likely GEM, itself, is sitting on tons of memory, we then try to unpin a bunch of objects in such failure cases and retry the allocation ourselves. If that doesn't help, or if there's no buffer object that can be unpinned, we give up and return -ENOMEM to userspace. One side effect of the trylock in the shrinker is that if any thread other than the allocator is hoggin our GEM mutex, we won't be able to unpin any gem objects, potentially causing an unnecessary OOM situation.
Despite that, this is a rather hackish approach and not really how memory management is supposed to work. It seems to hold up rather well, in reality, though. And recently (for kernel 3.7 and 3.8), Chris Wilson added some additional clever tricks to make it work for a bit longer when the kernel is tight on free memory. But in the longterm, I expect that we need to revamp our locking scheme to be able to reliably free memory in low-memory situations.
Another problem with the current low-memory handling is that the shrinker infrastructure is desinged to handle caches of equally-sized (filesystem) objects. It doesn't have any notion of objects of massively different sizes, which we try by reporting object counts in aan aggregate number of pages. And it also doesn't expect that when a shrinker runs, tons of memory suddenly shows up in the pagecache, which is what happens when we unpin the buffer object backing storage and give control over it back to shmemfs.
In the next installment, we will take a closer look at coherency issues and optimal ways to transfer data between the cpu and gpu.
In the previous installment, we took a closer look at some details of the gpu memory management. One of the last big topics left is all the various caches, both on the gpu (both render and display block) and the cpu, and what is required to keep the data coherent between them. One of the reasons gpus are so fast at processing raw amounts of data is that caches are managed through explicit instructions (cutting down massively on complexity and delays) and there are also a lot of special-purpose caches optimized for different use-cases. Because coherency management isn't automatic, we will also consider the different ways to move data between different coherency domains and what the respective up- and downsides are. See the i915/GEM crashcourse overview
for links to the other parts of this series.
Interactions with CPU Caches
The first thing to look at is how gpu-side caches work together with the cpu caches. Traditionally, the Intel gfx has been sitting right on top of the memory controller. For maximum efficiency, it therefore did not take part in the coherency protocol with the cpu caches, but had direct read/write access to the underlying memory. To make sure that data written by the cpu could be read by the gpu coherently, we need to flush cpu caches and further the chipset write queue on the separate memory controller. Only once the data is guaranteed to have reached physical memory, can it be read by the gpu.
For readbacks, such as after the gpu has rendered a frame, we again need to invalidate all the cpu caches. The gpu writes don't snoop cpu caches, so the cpu cache might still contain stale data. Again, on Intel hardware, this is done by the clflush instruction, which (only on Intel cpus, though) guarantees that clean cachelines will simply be dropped and not written back to memory. This is important, because otherwise data in memory written by the gpu could be overwritten by stale cache contents from the cpu.
Now on newer platforms (Sandybridge and later), where the gpu is integrated with the cpu on the same die, the render portion of the gpu is sitting on top of the last level caches, together with all the cpu cores. Current architecture is that the cpu cores (including L2 caches), gpu (again including gpu specific caches), memory controller, and the big L3 cache are all connected with a coherent ring interconnect. The gpu can profit from the big L3 cache on the die, giving a decent speedup for gpu workloads. So, it is now beneficial for gpu reads/writes to be coherent with all the cpu cache. Explicitly flushing cachelines is cumbersome and the last-level cache can easily hide the snooping latencies.
The odd thing out on this architecture is the display unit. To preserve the most power, in idle situations, we want to be able to turn off power for cpu cores, the gpu render block, and the entire L3 and interconnect (called uncore). This means the always-on display block can only read directly from memory and can't be coherent with caches. To allow this, we can specify the caching mode in the gtt pagetables, and so mark render targets, which will be used as scanout sources, as uncacheable. But that also implies that they're no longer coherent with cpu cache contents, and so we need to again engage in manual cacheline flushing, like on the platforms without a shared last level cache.
Transferring Data Between CPU and GPU Coherency Domains
The oldest way to move data between the gpu and cpu, in the i915 GEM kernel driver, was with the set_domain ioctl. If required, it flushed the cpu caches for all the cachelines for the entire memory address range of the object. That's really expensive, so we need more efficient ways to move date from/to the gpu, especially on platforms without a coherent last level cache. Note that all objects start out in the cpu domain because freshly-allocated pages aren't necessarily flushed. So it's important that userspace keeps around a cache of already flushed objects, to amortize the expensive initial cacheline flushing. Also, note that the set_domain ioctl is also used to synchronize cpu access with outstanding gpu rendering, and so, still used in that role for coherent objects.
For this, the kernel and older hardware support a snooping mode, where the gpu asks the cpu to clear any dirty cachelines before reading/write. The upside is that we don't blow through cpu time flushing cachelines, and the gpu seems to be more efficient at moving data out of and into cpu caches, anyway. The downside is, at least on older generations, that we can't use such snooped areas for rendering, but only to offload data transfers from/to the cpu to the gpu. But because there are now downsides for the cpu, in accessing snooped gem buffer objects, they are very interesting for mixed software/hardware rendering. Another upside is that the uploads can be streamed and are asynchronously done, so avoiding stalls. Downloads can obviously also be done asynchronously using the gpu, but additional unrelated queued rendering could introduce stalls because the cpu often needs the data right away.
So the tradeoffs for picking the gpu for copying data from/to snoopable buffers, compared with using the cpu to transfer data is tricky: For uploads, it's usually only worth it to use the gpu to avoid stalls, because our cpu uploads paths are fast. For downloads, on platforms without a shared last-level cached gpu, copies using snoopable buffers beats the cpu readbacks hands-down.
The i915 GEM driver exposes a set of ioctls to allow userspace to read/write linear buffers which try to pick the most efficient way to transfer the data from/to the gpu coherency domain, called pwrite and pread. Later on, we'll look at some of the trade-offs and tricks employed, but for now we'll just discuss how reads/writes, which only partially cover a cacheline, need to be handled if we write through the cpu mappings. Reads are simple. Every cacheline we touch needs to be flush before reading data written by the gpu. For writes, we only need to flush before writing if the cacheline is partially covered. Stale data from cpu caches in that cacheline could otherwise be written to memory. For writes covering entire cachelines, we can avoid this. So userspace tries to avoid such partial writes. Obviously, we also need to flush the cacheline after writing to it, to move the data out to the gpu.
All this flushing doesn't apply when the buffer is coherent, and generally on platforms with a last-level cache writing through the cpu caches is the most efficient way to copy data with the cpu (the gpu can obviously also copy things around because it doesn't pay a penalty for coherent access on such platforms). Going through the GTT to, for example, avoid manually tiling and swizzling a texture is much slower. The reason for that is that GTT writes land directly in main memory and don't go into the last-level cache, so we're limited by main memory bandwidth and can't benefit from the fast caches. It does, though, issue a snoop to make sure everything is still coherent.
On platforms without a shared last-level cache, it's a completely different story, though. We have to manually flush cpu caches, which is a rather expensive operation. GTT writes, on the other hand, bypass all caches and land directly in main memory. This means that on these platforms, GTT writes are generally 2-3x faster than writes through the cpu caches, a notch slower if the GTT area written to is tiled. GTT reads are, in all cases, extremely slow because the GTT I/O range is only write-combined, but uncached.
One of the clever tricks that the pwrite ioctl implements is that, on platforms where GTT writes are faster, it tries to first move the buffer object into the mappable part of the GTT (if the object isn't there yet and this can be done without blocking for the gpu). This way, the upload benefits from the faster GTT writes, which easily offsets any costs in unbinding a few other buffer objects and rewriting a bunch of PTEs. Especially because on recent kernels and most platforms, GTT PTEs are mapped with write-combining.
Summarizing the data transfer topic, we either have full coherency (where cpu cached memory mappings are the most efficient way to transfer data, with no explicit copying involved) or we can copy data with the gpu (using snooped buffer objects), or with the cpu (either through memory mappings of the GTT or through the pwrite/pread ioctls). The kernel also supports explicit cpu cache flushing through the set_domain ioctl, which allows userspace to use cpu-cached memory maps, even on platforms without coherency, but they're abysmally slow and so, not used in well-optimized code.
Up to now, we've only looked at cpu caches and presumed that the gpu is an opaque black box. But like already explained, gpus have tons of special-purpose caches and assorted tlbs, which need to be invalidated and flushed. First, we will go through the various gpu caches and look at how and when they're flushed. Then, we'll look at how gpu cache domains are implemented in the kernel.
Display caches are very simple: Data caches are always fifos, so never need to be flushed. And the assorted TLBs are in general invalidated on every vblank. The cpu mmio window into the GTT is also rather simple: No data caches need to be explicitly managed, only the tlb needs to be explicitly flush when updating PTEs, by writing to a magic register afterwards.
Much more complicated are the various special-purpose caches of the render core. Those all get flushed, either explicitly (by setting the right bit in the relevant flush instruction) or implicitly, as a side-effect of batch/ringbuffer commands. Important to note is that most of these caches are not fully coherent (safe for the new L3 gpu cache on Ivybridge), so writes do not invalidate cachelines in cpu caches, even when the respective gtt PTE has the snoop/caching bits set. Writes only invalidate the cpu caches once a cacheline gets evicted from the gpu caches. So the command streamer instruction which flush caches also are coherency barriers.
One of the cornerstones of the original i915 GEM design was that the kernel explicitly keeps track of all the caches (render, texture, vertex, ...) an object is coherent in. It separately kept track of read and write domains, to optimize away flush for read/write caches, if no cacheline should be dirty. But the code ended up being complicated and fragile, and thanks to tons of workarounds, we need to flush most caches on many platforms way more often than necessary. Furthermore, userspace inserted its own flushes to support, such as rendering to textures, so the kernel wasn't even tracking properly which caches were clean. On top of that, the explicit tracking of dirty objects on the flushing_list, to coalesce cache flushing operations, added unnecessary delays, hurting workloads with tight gpu/cpu coupling.
In the end, this all ended up being a prime example of premature optimization. So, for most parts, the kernel now ignores the gpu domain values in relocations entries and, instead, unconditionally invalidates all caches before starting a new batch and then flushes all write caches after the batch completes, but before the seqno is written and signaled with the CS interrupt. The big exception is workarounds, such as on Sandybridge writes from the command streamer, used mostly for GL queries, need global GTT entries shadowing the PPGTT PTEs. But all the in-kernel complexity to track dirty objects and gpu domains is mostly ripped out. The last remaining thing is to collapse all the different GEM domains into a simple boolean, to track whether an object is cpu or gpu coherent.
This now concludes our short tour of the i915.ko GEM implementation for submitting commands to the gpu. Two bigger topics are not covered though: Resetting the gpu when it's stuck somewhere and how fairness is ensured by throttling command submission. Both are results of the cooperative nature of command submission on current Intel gpus: batchbuffers cannot be preempted. So, interactive responsiveness and fairness requires cooperation from userspace. And a stuck batchbuffer can only be stopped by resetting the entire gpu, due to lack of preemption support.
In both areas, new ideas and improvements are floating around, so I've figured it best to cover them when we get improvements to the gpu reset code or the way we schedule batchbuffers.