Intel® Graphics for Linux*

Desktop-Quality 3D Graphics on Mobile Linux* Devices comes from the Intel Open Source Technology Center, providing the first open source driver certified by the Khronos Group under the OpenGL ES 3D Graphics 3.0 specification.

Linux Graphics Feed Aggregator

Aggregator items

Ian Romanick - 20 Nov, 2015 - 11:39am

Yesterday I pushed an implementation of a new OpenGL extension GL_EXT_shader_samples_identical to Mesa. This extension will be in the Mesa 11.1 release in a few short weeks, and it will be enabled on various Intel platforms:

  • GEN7 (Ivy Bridge, Baytrail, Haswell): Only currently effective in the fragment shader. More details below.
  • GEN8 (Broadwell, Cherry Trail, Braswell): Only currently effective in the vertex shader and fragment shader. More details below.
  • GEN9 (Skylake): Only currently effective in the vertex shader and fragment shader. More details below.

The extension hasn't yet been published in the official OpenGL extension registry, but I will take care of that before Mesa 11.1 is released.


Multisample anti-aliasing (MSAA) is a well known technique for reducing aliasing effects ("jaggies") in rendered images. The core idea is that the expensive part of generating a single pixel color happens once. The cheaper part of determining where that color exists in the pixel happens multiple times. For 2x MSAA this happens twice, for 4x MSAA this happens four times, etc. The computation cost is not increased by much, but the storage and memory bandwidth costs are increased linearly.

Some time ago, a clever person noticed that in areas where the whole pixel is covered by the triangle, all of the samples have exactly the same value. Furthermore, since most triangles are (much) bigger than a pixel, this is really common. From there it is trivial to apply some sort of simple data compression to the sample data, and all modern GPUs do this in some form. In addition to the surface that stores the data, there is a multisample control surface (MCS) that describes the compression.

On Intel GPUs, sample data is stored in n separate planes. For 4x MSAA, there are four planes. The MCS has a small table for each pixel that maps a sample to a plane. If the entry for sample 2 in the MCS is 0, then the data for sample 2 is stored in plane 0. The GPU automatically uses this to reduce bandwidth usage. When writing a pixel on the interior of a polygon (where all the samples have the same value), the MCS gets all zeros written, and the sample value is written only to plane 0.

This does add some complexity to the shader compiler. When a shader executes the texelFecth function, several things happen behind the scenes. First, an instruction is issued to read the MCS. Then a second instruction is executed to read the sample data. This second instruction uses the sample index and the result of the MCS read as inputs.

A simple shader like

#version 150 uniform sampler2DMS tex; uniform int samplePos; in vec2 coord; out vec4 frag_color; void main() { frag_color = texelFetch(tex, ivec2(coord), samplePos); }

generates this assembly

pln(16) g8<1>F g7<0,1,0>F g2<8,8,1>F { align1 1H compacted }; pln(16) g10<1>F g7.4<0,1,0>F g2<8,8,1>F { align1 1H compacted }; mov(16) g12<1>F g6<0,1,0>F { align1 1H compacted }; mov(16) g16<1>D g8<8,8,1>F { align1 1H compacted }; mov(16) g18<1>D g10<8,8,1>F { align1 1H compacted }; send(16) g2<1>UW g16<8,8,1>F sampler ld_mcs SIMD16 Surface = 1 Sampler = 0 mlen 4 rlen 8 { align1 1H }; mov(16) g14<1>F g2<8,8,1>F { align1 1H compacted }; send(16) g120<1>UW g12<8,8,1>F sampler ld2dms SIMD16 Surface = 1 Sampler = 0 mlen 8 rlen 8 { align1 1H }; sendc(16) null<1>UW g120<8,8,1>F render RT write SIMD16 LastRT Surface = 0 mlen 8 rlen 0 { align1 1H EOT };

The ld_mcs instruction is the read from the MCS, and the ld2dms is the read from the multisample surface using the MCS data. If a shader reads multiple samples from the same location, the compiler will likely eliminate all but one of the ld_mcs instructions.

Modern GPUs also have an additional optimization. When an application clears a surface, some values are much more commonly used than others. Permutations of 0s and 1s are, by far, the most common. Bandwidth usage can further be reduced by taking advantage of this. With a single bit for each of red, green, blue, and alpha, only four bits are necessary to describe a clear color that contains only 0s and 1s. A special value could then be stored in the MCS for each sample that uses the fast-clear color. A clear operation that uses a fast-clear compatible color only has to modify the MCS.

All of this is well documented in the Programmer's Reference Manuals for Intel GPUs.

There's More

Information from the MCS can also help users of the multisample surface reduce memory bandwidth usage. Imagine a simple, straight forward shader that performs an MSAA resolve operation:

#version 150 uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); for (int i = 0; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); frag_color = color / float(NUM_SAMPLES); }

The problem should be obvious. On most pixels all of the samples will have the same color, but the shader still reads every sample. It's tempting to think the compiler should be able to fix this. In a very simple cases like this one, that may be possible, but such an optimization would be both challenging to implement and, likely, very easy to fool.

A better approach is to just make the data available to the shader, and that is where this extension comes in. A new function textureSamplesIdenticalEXT is added that allows the shader to detect the common case where all the samples have the same value. The new, optimized shader would be:

#version 150 #extension GL_EXT_shader_samples_identical: enable uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count #if !defined GL_EXT_shader_samples_identical #define textureSamplesIdenticalEXT(t, c) false #endif in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); if (! textureSamplesIdenticalEXT(tex, ivec2(coord)) { for (int i = 0; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); color /= float(NUM_SAMPLES); } frag_color = color; }

The intention is that this function be implemented by simply examining the MCS data. At least on Intel GPUs, if the MCS for a pixel is all 0s, then all the samples are the same. Since textureSamplesIdenticalEXT can reuse the MCS data read by the first texelFetch call, there are no extra reads from memory. There is just a single compare and conditional branch. These added instructions can be scheduled while waiting for the ld2dms instruction to read from memory (slow), so they are practically free.

It is also tempting to use textureSamplesIdenticalEXT in conjunction with anyInvocationsARB (from GL_ARB_shader_group_vote). Such a shader might look like:

#version 430 #extension GL_EXT_shader_samples_identical: require #extension GL_ARB_shader_group_vote: require uniform sampler2DMS tex; #define NUM_SAMPLES 4 // generate a different shader for each sample count in vec2 coord; out vec4 frag_color; void main() { vec4 color = texelFetch(tex, ivec2(coord), 0); if (anyInvocationsARB(!textureSamplesIdenticalEXT(tex, ivec2(coord))) { for (int i = 0; i < NUM_SAMPLES; i++) color += texelFetch(tex, ivec2(coord), i); color /= float(NUM_SAMPLES); } frag_color = color; }

Whether or not using anyInvocationsARB improves performance is likely to be dependent on both the shader and the underlying GPU hardware. Currently Mesa does not support GL_ARB_shader_group_vote, so I don't have any data one way or the other.


The implementation of this extension that will ship with Mesa 11.1 has a three main caveats. Each of these will likely be resolved to some extent in future releases.

The extension is only effective on scalar shader units. This means on GEN7 it is effective in fragment shaders. On GEN8 and GEN9 it is only effective in vertex shaders and fragment shaders. It is supported in all shader stages, but in non-scalar stages textureSamplesIdenticalEXT always returns false. The implementation for the non-scalar stages is slightly different, and, on GEN9, the exact set of instructions depends on the number of samples. I didn't think it was likely that people would want to use this feature in a vertex shader or geometry shader, so I just didn't finish the implementation. This will almost certainly be resolved in Mesa 11.2.

The current implementation also returns a false negative for texels fully set to the fast-clear color. There are two problems with the fast-clear color. It uses a different value than the "all plane 0" case, and the size of the value depends on the number of samples. For 2x MSAA, the MCS read returns 0x000000ff, but for 8x MSAA it returns 0xffffffff.

The first problem means the compiler would needs to generate additional instructions to check for "all plane 0" or "all fast-clear color." This could hurt the performance of applications that either don't use a fast-clear color or, more likely, that later draw non-clear data to the entire surface. The second problem means the compiler would needs to do state-based recompiles when the number of samples changes.

In the end, we decided that "all plane 0" was by far the most common case, so we have ignored the "all fast-clear color" case for the time being. We are still collecting data from applications, and we're working on several uses of this functionality inside our driver. In future versions we may implement a heuristic to determine whether or not to check for the fast-clear color case.

As mentioned above, Mesa does not currently support GL_ARB_shader_group_vote. Applications that want to use textureSamplesIdenticalEXT on Mesa will need paths that do not use anyInvocationsARB for at least the time being.


As stated by issue #3, the extension still needs to gain SPIR-V support. This extension would be just as useful in Vulkan and OpenCL as it is in OpenGL.

At some point there is likely to be a follow-on extension that provides more MCS data to the shader in a more raw form. As stated in issue #2 and previously in this post, there are a few problems with providing raw MCS data. The biggest problem is how the data is returned. Each sample count needs a different amount of data. Current 8x MSAA surfaces have 32-bits (returned) per pixel. Current 16x MSAA MCS surfaces have 64-bits per pixel. Future 32x MSAA, should that ever exist, would need 192 bits. Additionally, there would need to be a set of new texelFetch functions that take both a sample index and the MCS data. This, again, has problems with variable data size.

Applications would also want to query things about the MCS values. How many times is plane 0 used? Which samples use plane 2? What is the highest plane used? There could be other useful queries. I can imagine that a high quality, high performance multisample resolve filter could want all of this information. Since the data changes based on the sample count and could change on future hardware, the future extension really should not directly expose the encoding of the MCS data. How should it provide the data? I'm expecting to write some demo applications and experiment with a bunch of different things. Obviously, this is an open area of research.

Sarah Sharp - 24 Oct, 2015 - 06:49am

By now, most of the Linux world knows I’ve stopped working on the Linux kernel and I’m starting work on the Linux graphics stack, specifically on the userspace graphics project called Mesa.

What is Mesa?

Like the Linux kernel, Mesa falls into the “operating system plumbing” category. You don’t notice it until it breaks. Or clogs, slowing down the rest of the system. Or you need plumbing for your shiny new hot tub, only you’re missing some new essential feature, like hot water or tessellation shaders.

The most challenging and exciting part of working on systems “plumbing” projects is optimization, which requires a deep understanding of both the hardware limitations below you in the stack, and the needs of the software layers above you. So where does Mesa fit into the Linux graphics stack?

Mesa is the most basic part of the userspace graphics stack, sitting above the Linux kernel graphics driver that handles command submission to the hardware, and below 3D libraries like qt, kde, or game engines like Unity or Unreal. A game engine creates a specialized 3D program, called a shader, that conforms to a particular rev of a graphics spec (such as EGL 3.0 or GLES 3.1). Mesa takes that shader program and compiles it into graphics hardware commands specific to the system Mesa is running on.

What’s cool about Mesa?

The exciting thing for me is the potential for optimizing graphics around system hardware limitations. For example, you can optimize the compiler to generate less graphics commands. In theory, less commands means less for the hardware to execute, and thus better graphics performance. However, if each of those commands are really expensive to run on your graphics hardware, that optimization can actually end up hurting your performance.

This kind of work is fun for me, because I get to touch and learn about hardware, without actually writing Verilog or VHDL. I get to make hardware do interesting things, like add a new features that makes the latest Steam game actually render or add an optimization that improves the performance of a new open source Indie game.

Understand how much you don’t know

Without knowledge of both the hardware and software layers above, it’s impossible to evaluate what optimizations are useful to pursue. For example, the first time I heard modern Intel GPUs have a completely separate cache from the CPU, I asked two questions: “What uses that cache?” and “What happens when the cache is full?” The hardware Execution Units (EUs) that execute graphics commands use the cache to store metadata called URBs. My next question, on discovering that URB size could vary, and that newer hardware had an ever-increasing maximum URB size, was to ask, “How does the graphics stack pick which URB size to use?” Obviously, picking a smaller URB size means that more URBs can fit in the cache, which means it’s less likely that an EU will have to stall until there’s room in the cache for the URB it needs for the set of graphics commands its executing. Picking an URB size that was too small would mean programs that use a lot of metadata wouldn’t be able to run.

Mesa is used by a lot of different programs with different needs, and each may require different URB sizes. The hardware has to be programmed with the maximum URB size, and changing that requires stopping any executing commands in the graphics pipeline. So you can’t go changing the URB size on the fly every time a new 3D graphics program starts running, or you would risk stalling the whole graphics stack. Imagine your entire desktop system freezing every time you launched a new Steam game.

Asking questions helps you ramp faster

It’s my experience with other operating system components that lead me to ask questions of my more experienced graphics developer co-workers. I really appreciate working in a community where I know I can ask these kinds of basic questions without fear of backlash at a newcomer. I love working on a team that chats about tech (and non tech!) over lunch. I love the easy access of the Intel graphics and DRI irc channels, where people are willing to answer simple questions like “How do I build the documentation?” (Which turns out to be complex when you run into a bug in doxygen that causes an infinite loop and increasing memory consumption until a less-than-capable build box starts to freeze under memory pressure. Imagine what would have happened if the experienced devs assumed I was a n00b and ignored me?)

My experience with other complex systems makes me understand that the deep, interesting problems can’t be approached without a long ramp up period and lots of smaller patches to gain understanding of the system. I do chafe a bit as I write those patches, knowing the interesting problems are out there, but I know I have to walk before I start running. In the meantime, I find joy in making blog posts about what I’m learning about the graphics pipeline, and I hope we can walk together.


Ian Romanick - 14 Oct, 2015 - 09:18am

In the old days, there were not a lot of users of open-source 3D graphics drivers. Aside from a few ID Software games, tuxracer, and glxgears, there were basically no applications. As a result, we got almost no bug reports. It was a blessing and a curse.

Now days, thanks to Valve Software and Steam, there are more applications than you can shake a stick at. As a result, we get a lot of bug reports. It is a blessing and a curse.

There's quite a wide chasm between a good bug report and a bad bug report. Since we have a lot to do, the quality of the bug report can make a big difference in how quickly the bug gets fixed. A poorly written bug report may get ignored. A poorly written bug may require several rounds asking for more information or clarifications. A poorly written bug report may send a developer looking for the problem in the wrong place.

As someone who fields a lot of these reports for Mesa, here's what I look for...

  1. Submit against the right product and component. The component will determine the person (or mailing list) that initially receives notification about the bug. If you have a bug in the Radeon driver and only people working Nouveau get the bug e-mail, your bug probably won't get much attention.

    If you have experienced a bug in 3D rendering, you want to submit your bug against Mesa.

    Selecting the correct component can be tricky. It doesn't have to be 100% correct initially, but it should be close enough to get the attention of the right people. A good first approximation is to pick the component that matches the hardware you are running on. That means one of the following:

    If you're not sure what GPU is in your system, the command glxinfo | grep "OpenGL renderer string" will help.

  2. Correctly describe the software on your system. There is a field in bugzilla to select the version where the bug was encountered. This information is useful, but it is more important to get specific, detailed information in the bug report.

    • Provide the output of glxinfo | grep "OpenGL version string". There may be two lines of output. That's fine, and you should include both.

    • If you are using the drivers supplied by your distro, include the version of the distro package. On Fedora, this can be determined by rpm -q mesa-dri-drivers. Other distros use other commands. I don't know what they are, but Googling for "how do i determine the version of an installed package on " should help.

    • If you built the driver from source using a Mesa release tarball, include the full name of that tarball.

    • If you built the driver from source using the Mesa GIT repository, include SHA of the commit used.

    • Include the full version of the kernel using uname -r. This isn't always important, but it is sometimes helpful.

  3. Correctly describe the hardware in your system. There are a lot of variations of each GPU out there. Knowing exactly which variation exhibited bad behavior can help find the problem.

    • Provide the output of glxinfo | grep "OpenGL renderer string". As mentioned above, this is the hardware that Mesa thinks is in the system.

    • Provide the output of lspci -d ::0300 -nn. This provides the raw PCI information about the GPU. In some cases this maybe more detailed (or just plain different) than the output from glxinfo.

    • Provide the output of grep "model name" /proc/cpuinfo | head -1. It is sometimes useful to know the CPU in the system, and this provides information about the CPU model.

  4. Thoroughly describe the problem you observed. There are two goals. First, you want the person reading the bug report to know both what did happen and what you expected to happen. Second, you want the person reading the bug report to know how to reproduce the problem you observed. The person fielding the bug report may not have access to the application.

    • If the application was run from the command line, provide the full command line used to run the application.

      If the application is a test that was run from piglit, provide the command line used to run the test, not the command line used to run piglit. This information is in the piglit results.json result file.

    • If the application is game or other complex application, provide any necessary information needed to reproduce the problem. Did it happen at specific location in a level? Did it happen while performing a specific operation? Does it only happen when loading certain data files?

    • Provide output from the application. If the application is a test case, provide the text output from the test. If this output is, say, less than 5 lines, include it in the report. Otherwise provide it as an attachment. Output lines that just say "Fail" are not useful. It is usually possible to run tests in verbose mode that will provide helpful output.

      If the application is game or other complex application, try to provide a screenshot. If it is possible, providing a "good" and "bad" screenshot is even more useful. Bug #91617 is a good example of this.

      If the bug was a GPU or system hang, provide output from dmesg as an attachment to the bug. If the bug is not a GPU or system hang, this output is unlikely to be useful.

  5. Add annotations to the bug summary. Annotations are parts of the bug summary at the beginning enclosed in square brackets. The two most important ones are regression and bisected (see the next item for more information about bisected). If the bug being reported is a new failure after updating system software, it is (most likely) a regression. Adding [regression] to the beginning of the bug summary will help get it noticed! In this case you also must, at the very least, tell us what software version worked... with all the same details mentioned above.

    The other annotations used by the Intel team are short names for hardware versions. Unless you're part of Intel's driver development team or QA team, you don't really need to worry about these. The ones currently in use are:

  6. Bisect regressions. If you have encountered a regression and you are building a driver from source, please provide the results of git-bisect. We really just need the last part: the guilty commit. When this information is available, add bisected to the tag. This should look like [regression bisected].

    Note: If you're on a QA team somewhere, supplying the bisect is a practical requirement.

  7. Respond to the bug. There may be requests for more information. There may be requests to test patches. Either way, developers working on the bug need you to be responsive. I have closed a lot of bugs over the years because the original reporter didn't respond to a question for six months or more. Sometimes this is our fault because the question was asked months (or even years) after the bug was submitted.

    The general practice is to change the bug status to NEEDINFO when information or testing is requested. After providing the requested information, please change the status back to either NEW or ASSIGNED or whatever the previous state was.

    It is also general practice to leave bugs in the NEW state until a developer actually starts working on it. When someone starts work, they will change the "Assigned To" field and change the status to ASSIGNED. This is how we prevent multiple people from duplicating effort by working on the same bug.

    If the bug was encountered in a released version and the bug persists after updating to a new major version (e.g., from 10.6.2 to 11.0.1), it is probably worth adding a comment to the bug "Problem still occurs with ." That way we know you're alive and still care about the problem.

Here is a sample script that collects a lot of information request. It does not collect information about the driver version other than the glxinfo output. Since I don't know whether you're using the distro driver or building one from source, there really isn't any way that my script could do that. You'll want to customize it for your own needs.

echo "Software versions:" echo -n " " uname -r echo -n " " glxinfo 2>&1 | grep "OpenGL version string" echo echo "GPU hardware:" echo -n " " glxinfo 2>&1 | grep "OpenGL renderer string" echo -n " " lspci -d ::0300 -nn echo echo "CPU hardware:" echo -n " " uname -m echo -n " " grep "model name" /proc/cpuinfo | head -1 | sed 's/^[^:]*: //' echo

This generates nicely formatted output that you can copy-and-paste directly into a bug report. Here is the output from my system:

Software versions: 4.0.4-202.fc21.x86_64 OpenGL version string: 3.0 Mesa 11.1.0-devel GPU hardware: OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 5500 (Broadwell GT2) 00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) CPU hardware: x86_64 Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz

Daniel Vetter - 18 Sep, 2015 - 08:27am
I've done a talk at XDC 2015 about atomic modesetting with a focus for driver writers. Most of the talk is an overview of how an atomic modeset looks and how to implement the different parts in a driver backend. Anyway, for all those who missed it, there's a video and slides.

Daniel Vetter - 07 Sep, 2015 - 02:40am
Kernel 4.2 is released already and the 4.3 merge window in full swing, time to look at what's in it for the intel graphics driver.

Biggest thing for sure is that Skylake is finally out of preliminary support and enabled by default. The reason for the long hold-up was some ABI fumble - the hardware exposes the topmost plane both through the new universal plane registers and the legacy cursor registers and because we simply carried the legacy plane code around in the driver we ended up exposing both. This wasn't something big to take care of but somehow was dragged on forever.

The other big thing is that now legacy modesets are done with the new atomic modesetting code driver-internally. Atomic support in i915.ko isn't ready for prime-time yet fully, but this is definitely a big step forward. Besides atomic there's also other cross-platform improvements in the modeset code: Ville fixed up the 12bpc support for HDMI, which is now used by default if the screen supports it. Mika Kahola and Ville also implemented dynamic adjustment of the cdclk, which is the main clock source for display engines on intel graphics. And there's a big difference in the clock speeds needed between e.g. a 4k screen and a 720p TV.

Continuing with power saving features Rodrigo again spent a lot of time fixing up PSR (panel self refresh). And Paulo did the same by writing patches to improve FBC (framebuffer compression). We have some really solid testcases by now, unfortunately neither feature is ready for enabling by default yet. Especially PSR is still plagued by screen freezes on some random systems. Also there's been some fixes to DRRS (dynamic refresh rate switching) from Ramalingam. DRRS is enabled by default already, where supported. And finally some improvements to make the frontbuffer rendering tracking more accurate, which is used by all three of these display power saving features.

And of course there's also tons of improvements to platform code. Display PLL code for Sklylake and Valleyview&Cherryview was tuned by Damien and Ville respectively. There's been tons of work on Broxton and DSI support by Imre, Gaurav and others.

Moving on to the rendering side the big change is how tracking of rendering tasks is handled. In the past the driver just used raw sequence numbers emitted by the hardware, but for cross-driver synchronization and reordering tasks with an eventual gpu scheduler more abstraction is needed. A big step is converting over to the i915 request structure completely, done by John Harrison. The next step will be to switch the internal implementation for i915 requests to the cross-driver fences, but that's for future kernels. As a follow-up cleanup John also removed the OLR, which stands for outstanding lazy request. It was a neat little trick implemented years ago to simplify handling error recovery, but which causes tons of pain with subtle bugs. Making requests more explicit in the driver allowed us to finally remove this trick since.

There's also been a pile of platform related features: MOCS programming for Skylake/Broxton (which is used for caching control). Resource streamer support from Abdiel, which is used to offload some of the buffer object tracking for shaders from the cpu to the gpu. And the command parser on Haswell was extended to support atomic instructions in shaders. And finally for Skylake Mika Kuoppala added code to avoid resetting the gpu - in certain cases the hardware would hard-hang the entire system trying to execute the reset. And a dead gpu is still better than a dead system.

Daniel Vetter - 16 Aug, 2015 - 06:52am
After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!