# Sorry, you need to enable JavaScript to visit this website.

## Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

# Beignet

Beignet is an open source implementation of the OpenCL specification, supporting the Intel OpenCL runtime library and compiler. OpenCL defines an implementation of parallel computation, one of the most dramatic shifts in the industry providing new performance opportunity through software engineering. With OpenCL, many programs (2D rendering, image/video processing, etc.) can be dramatically accelerated by the GPU.

# How modern OpenGL and GPUs work

BY Martina Kollarova ON Feb 07, 2017

This article is for you if you've been programming in OpenGL*, but if it still feels like magic, or if you have previously used old-style legacy OpenGL, perhaps you are wondering why everything is suddenly so complicated in modern OpenGL.

People usually learn GPU programming either by a practical approach, where you learn a couple of commands to show some triangles, or via the math route of linear algebra. I found both approaches to be very confusing, because I couldn't imagine what actually happens in the hardware, until I learned a little about General-Purpose GPU (GPGPU) programming. Therefore, I will introduce a couple of basic concepts from OpenCL* (Open Computing Language), although I won’t go into too many details.

You might think of a GPU as something that is used to show triangles on the screen. That's what it originally was, but it's useful to first look at how GPUs can be used for general computation, and only then consider how GPUs specialize for graphics. Think of the GPU as an extra CPU that happens to be better suited for parallel computations and has some extra memory. Forget all about "graphics" stuff for now. You can do the same kinds of computations on a GPU as you would on a CPU—after all, it's Turing-complete. The first reason we don’t usually do the same computations on a GPU is because the CPU is faster at some things (complicated control flow) while the GPU is faster at others (lots of data that all needs the same kind of computation done on it). Another reason is that you need to transfer your data from RAM to the GPUs memory, which often means that you will spend more time uploading data than actually computing anything.

Your main workflow on the GPU consists of these three steps:

1. Upload[1] data into the GPU memory.
2. Do some computation on the data.
3. Read the results (for example, print them or visualize them on the screen).

The first step is done in OpenGL by a function called glBufferData[2] (or the more specialized glTexImage2D[3], which is meant for textures). You will also have to tell OpenGL in which variable this data should be accessible by the shader and how the data should be interpreted (for example: pairs or triplets, floating point or integer). Most of the data in a typical GPU application only has to be uploaded once. For example, a game would upload the textures and models on startup, and then during gameplay upload only the matrices that represent the camera and the positions of the models. This is also one of the reasons why people have moved away from legacy OpenGL. Legacy OpenGL had to communicate with the GPU all the time, and GPU processing power has improved a lot faster than the bandwidth and latency between the CPU and GPU.

The second step is done by executing a program on the GPU. This program is called a kernel when using the GPU directly through something like OpenCL, or a shader when we are dealing with OpenGL. The shader is just a specialized kernel where OpenGL defines some of the inputs and outputs by default (for example: gl_Fragcolor). This program does not necessarily have to do anything with triangles or pixels, although it’s usually used that way in OpenGL.

What happens in the third step depends on what you want with the data. If you want to print it on your console, you will have to download[4] them back to the main RAM. Showing the data visually is easier. Since the GPU evolved from the video card connected to the monitor, you usually don’t have to do anything and OpenGL will show the output of the fragment shader on the screen. In an alternative universe where the GPU would be some completely separate processing unit, you would have to copy the output data to the video card.

When programming GPUs, you first have to get used to thinking about parallel execution. The following examples show the conceptual difference between traditional and parallel programming.

```for(int i = 0; i < len; i++) {
result[i] = a[i] + b[i];
}
```

GPU kernel:

```unsigned int i= get_global_id();
result[i] = a[i] + b[i]```

You’re probably familiar with the code in the first example. The second one is written in OpenCL (with a couple details removed). Both of them will calculate the sum of two arrays. You might notice that the second snippet doesn’t have any loop—this is because it is going to get executed on a number of kernel-instances[5] equal to the size of the arrays. Each kernel-instance is going to get a unique ID using get_global_id, then sum their piece of the array.

Perhaps you can already imagine how multiplying a big matrix could be done very efficiently using this execution model. Each element can be computed independently from other elements, meaning that there is no need for communication. Multiplying small matrices (at most 4x4) actually has its own instructions, and 4x4 matrices are all that 3D graphics usually require.

The following code is a typical simple OpenGL vertex shader:

```in vec4 vertex;
uniform mat4 matrix;
void main()
{
gl_Position = matrix * vertex;
}
```

The inputs, `vertex` and `matrix`, are data that we uploaded to the GPU and named. The matrix is a uniform variable, meaning that it will be constant for all vertices. Usually, this would be some camera matrix that represents where the user is looking. We multiply the vertex by the matrix to determine what the actual position of the vertex should be, relative to where the camera is looking. The output gl_Position is specified by default in OpenGL and will be passed on into the rest of the pipeline.

An instance of this shader will execute for each vertex in parallel. This is true even for vertices that will not eventually be shown on the screen, because we only know their actual position after the vertex shader finishes. Since you are only accessing one vertex, you might have noticed some limitations: you can’t communicate between the vertices (for example, you can’t say that I want this vertex to be 10 units next to the previous vertex), and you can’t change the number of vertices. Sometimes you really need to have some sort of communication between the instances. For example, a fragment shader might want to implement a blur, for which it has to know the values of the surrounding pixels. To do this, you need to first run a fragment shader that saves the output into a texture, and then run a second fragment shader on that texture[6].

## SIMD

The GPU doesn’t have a control unit for each and every kernel-instance. In theory, you could create a GPU by putting together a couple hundred traditional cores, but it would be very expensive. Instead, the GPU makers try to simplify the circuitry by making several kernel-instances share a control unit. In other words, the kernel-instances share an instruction pointer in what we call a thread[7]. One thread of kernel-instances will execute them all together by using the Single Instruction, Multiple Data (SIMD) architecture. Instead of multiplying two numbers, for example, SMID multiplies sixteen of them at once. How many kernel-instances get executed by a single thread depends strongly on the GPU and on the data size. It might be able to run 32 kernel-instances on a thread if you only use a small data type, like a byte, or four kernel-instances per thread if it’s operating on a 32-bit number.

All of this will be mostly invisible to you as you write the code, but can become important once you try to optimize the code. Consider what happens if you put an if-else condition into your kernel/shader.

This is why I said earlier that GPUs are not suited for problems requiring complicated control flow: the thread will actually stop executing one branch. This is called divergent branching, and it will cause the thread to first compute the data in the first branch until it’s all done, then compute the data in the second branch8. If you nest enough if-else conditions inside each other, your program might become unable to take advantage of SIMD, and it might even become slower than just doing it on the CPU.

## The pipeline

Originally, GPUs used fixed, special-purpose hardware to calculate things that were needed to get the positions of triangles and make them appear as pixels on the screen; this was called the fixed pipeline. Later GPUs added two programmable units, the vertex shader and the fragment shader, in the middle of this pipeline that could modify the usual data in whatever way the programmer wants. The specialized hardware still did most of it, but now you could create a couple of lighting and shading effects. The vertex shader allowed you to run a program on each vertex, and the fragment shader allowed you to run a program for each pixel. You can create an almost empty vertex and fragment shader, and your OpenGL application will still work.

OpenGL won’t really give you full control over the rendering process. If you pardon the simplification, it will run part of the fixed pipeline first, and only then run your vertex shader. Afterwards it runs its own code again, doing things like deciding which vertices are not going to be visible and rasterization (generating pixels out of vertices). Once that finishes, it allows you to change the colors of the pixels in the fragment shader. More recent versions of OpenGL also allow you to run geometry and tessellation shaders, but these are beyond the scope of this article. I recommend looking at the documentation of the OpenGL pipeline made by Khronos and these detailed pipeline schematics.

## Problems suited for the GPU

Shader programs are good in scenarios where you have a lot of data that you want to perform the same operation on. Ideally the pieces of data don't depend on each other.

As an illustration of this, imagine an army of 100,000 soldiers marching towards some goal. The marching band tells the soldiers what to do: go left, go right, stop, etc. Each soldier is independent and only needs to listen to the music to know where to go (let's say they are blindfolded and don't see what the others are doing). This kind of operation is perfect for the GPU.

However, it would be extremely bad to use the GPU to compute long-range combat. Each soldier might affect any other soldier, and the communication overhead would make the GPU useless. A middle-ground scenario would be melee (close-range) combat where the soldiers can be divided into a grid of square areas, and then you compute the combat only locally in that square. However, you still have to compute the edges of those squares where there is interaction between soldiers who are in different squares but are still close to each other. Whether it would be faster to run a simulation of this close-range combat on the CPU or the GPU depends greatly on how well you can section those squares, how many soldiers there are, and how similar the actions of the soldiers are.

A typical computation that suits GPUs is matrix multiplication or addition, because each element can be determined independently of the other results. Any calculation that you can express as a system of linear equations can be transformed into a problem involving matrices. Another example is fluid simulation, which is conceptually similar to our close-range combat analogy.

## Why shaders are compiled at run-time

When we talked about the basic three steps any GPU program has to go through, I omitted the part about compiling shaders. This was a question that bothered me as a beginner in GPU programming—why do we add shaders as text files and load them on the fly? Why can’t we compile them and ship them together with the rest of the binaries?

Well, remember how you have to compile different binaries for x86 and ARM? Imagine that instead you have to compile different versions for each GPU type (for example, x86-skylake and x86-haswell). You'd end up with hundreds of different binaries. There is no standard GPU architecture to which you can compile the code because there are hundreds of GPU architectures. This makes dealing with shaders more difficult, but allows GPUs to improve faster than CPUs because GPUs don’t have to deal with backward-compatibility as much.

Some systems partially solve this problem by precompiling the shaders into intermediate bytecode, notably DirectX and Vulkan, which makes it easier for the drivers to translate the shaders into their machine code. However, OpenGL doesn’t support this, so your shaders will have to get compiled during your application’s startup.

## Conclusion

This article is basically an explanation of things that confused me the most when I was learning OpenGL a while ago and kept feeling like I had no idea what I was doing. Since I can’t send it back in time to myself, I hope at least that you can benefit from this article. If you want to know more about the details of how GPUs work, I recommend reading the architecture documentation on Skylake GPUs, the Intel Programmer's Reference Manual (the older ones are a bit easier to digest), or for a more general overview, Section IV of GPU Gems 2.

## Footnotes

1. It’s also possible to map the memory between the main RAM and GPU memory for direct access.
2. You could also use the shared memory variants glMapBuffer or glMapBufferRange.
3. The shared memory alternative is, for example, EGLImageTargetTexture2DOES.
4. Or just read it directly, if you have mapped the memory between the RAM and GPU using, for example, glMapBuffer.
5. A kernel-instance is also sometimes called a work-item, or thread (which makes it a bit confusing, as you will see later).
6. You can look at the Postprocessing section in the Framebuffers tutorial for examples.
7. Terminology between GPU vendors can be a little confusing. When Intel says “thread”, they mean a group of data being computed together with a single instruction, so each thread has its own instruction pointer. Meanwhile, Nvidia calls this a “warp”, and the unit representing a single piece of data (for example, a vertex) is called a thread. A warp has its own instruction pointer, while a thread doesn’t. This is also why Intel calls its execution model SIMD (Single Instruction, Multiple Data), while Nvidia calls its own model SIMT (Single Instruction, Multiple Threads). SIMD and SIMT aren’t equal to each other.
8. The exact order of how branches get executed and how they are later joined together is actually rather complicated and depends on the GPU.