Sorry, you need to enable JavaScript to visit this website.

Sharing CPU and GPU buffers on Linux*

BY 01 Staff (not verified) ON May 12, 2016


The fact that the CPU and GPU share physical memory through advanced and smart hierarchy logic on Intel® Architecture (IA) is a key feature to efficiently use graphic textures. During the past year, Intel Open-Source Technology Center (OTC) has been leveraging this hardware feature on Chrome* OS using a technique called zero-copy texture upload. This resulted in performance improvements, reduced memory usage, and power savings when displaying Web content, and is now planned to be integrated into all Chrome OS devices based on IA.

In this article we introduce a new application programming interface (API) designed to share CPU and GPU graphics buffers so the zero-copy technique can both work more efficiently and also scale up to other variants of Linux. This will be available in Linux version 4.6.

System requirements

The following core requirements were analyzed when designing the new buffer-sharing system:

Simple and vendor independent API: different hardware platforms present different challenges and puzzles for graphics developers. Ideally, developers desire to use the minimal amount of user-space libraries needed, both to ease development and also to decrease the attack surface. Besides, cache coherency is something that requires special attention when mapping the memory buffers into the CPU and GPU simultaneously. We addressed this requirement by exposing a single, vendor-independent API in a way that operating system (OS) applications can be written using the very same interface regardless of the underlying hardware.

Efficiency and robustness: the idea of sharing physical memory in different device domains and processes means that a few copies of a given chunk of memory could be saved. By doing this in a smart and efficient way, the overall resources employed by the OS could be significantly reduced.

Security: we designed the buffer-sharing API with basic sandbox concepts in mind so malicious applications have less chance to exploit systems using it. While the buffer allocation task may happen in a privileged and secure process, unprivileged processes would have permissions to only map, read, and write buffers. This can be used in systems where the privileged process interacts with the OS while the unprivileged (e.g. user application) runs with very restricted accesses inside the sandbox, limiting the severity of bugs.


To address the design requirements previously mentioned, and desiring to build a system with those characteristics, we modified the Linux dma-buf API. The most important changes in that kernel submodule are the following:

  • mmap(dma_buf_fd, ...): The ability to map a dma-buf file-descriptor of a graphics buffer to the userspace, and more importantly, to actually write on the mapped pointer (which was not possible before). It’s worth noting that the Direct Rendering Manager (DRM) and the hardware driver implementation are fundamentally important to safely export the graphics handle to be mapped.
  • ioctl(dma_buf_fd, DMA_BUF_IOCTL_SYNC, &args): Cache coherency management in cases where the CPU and GPU devices are being accessed through dma-buf at the same time. Coherency markers, which forward directly to existing dma-buf device drivers vfunc hooks, are exposed to the userspace through the DMA_BUF_IOCTL_SYNC ioctl and have to be used before and after the mapped area is accessed. This is fundamentally important in hardware architectures where the graphics engine and the CPU cores don't share caches but also important in other type of hardware where the memory hierarchy is (most of the time) coherent. More details can be found in this patch set.


We use intel-gpu-tools to demonstrate the aforementioned API changes. You can see the full example here.

Initializing and exporting the buffer

The idea is to create a buffer object (BO) —in this case the framebuffer's—in one process and then export and pass its prime fd to another process, which in turn uses the fd to map and write.

In the example, gpu pointer is the privileged process. It has access to the system graphics routines, like DRM, display management, and driver accesses. Before run_test is executed, gpu is properly initialized with framebuffer information.

static void run_test(gpu_process_t *gpu)
        struct drm_prime_handle args;
        args.handle = gpu->fb.gem_handle;
        args.flags = DRM_CLOEXEC | DRM_RDWR;
        ioctl(gpu->drm_fd, DRM_IOCTL_PRIME_HANDLE_TO_FD, &args);
        prime_fd = args.fd;
        if (fork() == 0) {
                init_renderer(prime_fd, gpu->fb.size, gpu->fb.width, gpu->fb.height);

By using the framebuffer handle (gpu->fb.gem_handle) it’s possible to acquire its dma-buf file descriptor (prime_fd) that will be shared later with the other process, the render.

The render pointer is basically a userspace regular client. It's the unprivileged process with limited system accesses. We initialize it by sharing only the dma-buf file descriptor and framebuffer dimensions, and then we are almost ready to write on the framebuffer.

static void init_renderer(int prime_fd, int fb_size, int width, int height)
        render_process_t render;

        render.prime_fd = prime_fd;
        render.size = fb_size;
        render.width = width;
        render.height = height;

Mapping dma-buf fd

Now we are able to map the exported buffer by simply calling the mmap function on its dma-buf file descriptor.

static void paint(render_process_t *render)
    void *frame;
    frame = mmap(NULL, render->size, PROT_READ | PROT_WRITE, MAP_SHARED,
    render->prime_fd, 0);
    igt_assert(frame != MAP_FAILED);
} // end of paint()

Cache control

At this point, we can perform painting. We need to take special care when accessing the mapped pointer because of eventual inconsistencies between the processing devices. Therefore we should always wrap the pointer with the ioctl syncing calls, independently if the hardware running it coherent or not, as stated in the dma-buf API documentation; by not doing so the system may render artifacts on the screen.

static void paint(render_process_t *render)
    rect_t rect;
    int stride, bpp, color = 0xFF;
    int x, y, line_begin;
    struct dma_buf_sync sync_args;

    sync_args = DMA_BUF_SYNC_START;
    ioctl(render->prime_fd, DMA_BUF_IOCTL_SYNC, sync_args);

    for (y = rect.y; y < rect.y + rect.h; y++) {
        line_begin = y * stride / (bpp / 8);
        for (x = rect.x; x < rect.x + rect.w; x++)
        set_pixel(frame, line_begin + x, color, bpp);

    sync_args = DMA_BUF_SYNC_END;
    ioctl(render->prime_fd, DMA_BUF_IOCTL_SYNC, sync_args);
    munmap(frame, render->size);
} // end of paint()

Open issues and conclusion

How does the client know whether the BO was created X-tiled or Y-tiled for example, and how it will map back regarding the memory layout? In the current dma-buf API there is no way to handle these buffer semantics. This is definitely something we will need to address through our future work.

In this article we have demonstrated that CPU texturing can be programmed fairly easily using IA. In Chrome OS, we successfully replaced the VGEM buffer sharing system, which was the source of lots of debate and confusion in the DRI and kernel communities, with the new dma-buf API. The implementation of this new API was introduced in the dma-buf submodule and will be available in Linux kernel version 4.6.

About the author

Tiago Vignatti is a programmer in the Open Source Technology Center and has been working with industry-wide open source graphics for almost ten years now. Tiago has influenced the development of Intel platforms for Linux and Mesa* graphics, the X Window and Wayland* systems, and Chrome.