Sorry, you need to enable JavaScript to visit this website.

C for Metal Development Package

The Intel® C for Metal development package is a software development package for Intel® Graphics Technology. It includes the Intel® C for Metal Compiler, the Intel® C for Metal Runtime, Intel® Media Driver for VAAPI, and reference examples, which can be used to develop applications accelerated by Intel® Graphics Media Accelerator. A typical application contains two kinds of source code, kernel and host. The kernel is written in Intel® C for Media language, compiled to GPU ISA binary by the Intel® C for Metal Compiler, and executed on the GPU. Host manages workloads through the Intel® C for Metal Runtime and user mode media driver.

Using Media Walker with Thread Dependence

BY Li Huang ON Jun 13, 2019

Tutorial 4. Using Media Walker with Thread Dependence

Media-walker also incorporates a mechanism for launching threads in certain partial order (a.k.a. setting thread-dependency pattern).

CalcIntImage in examples is a good example for using the thread-dependence pattern provided with media walker. In order to compute integral image efficiently. we want to start the compuation at the up-left corner, every pixel uses the results of its three neighbors: up, left, and up-left. See the example CalcIntImage for details. Here we only show the distinctive part of the host and device code.

In this example, we create thread space as before, then set the dependency pattern. Media walker supports two dependence pattern:

  • Wavefront: thread (x,y) depends on thread(x-1,y) and thread(x, y-1)
  • Wavefront26: thread(x,y) depends on thread(x-1, y) and thread(x+1, y-1)

In this case, we use the wavefront pattern.


// Creates a CmThreadSpace object.
// There are two usage models for the thread space. One is to define the
// dependency between threads to run in the GPU. The other is to define a
// thread space where each thread can get a pair of coordinates during
// kernel execution. For this example, we use both usage models.
CmThreadSpace *thread_space = nullptr;

// Selects thread dependency pattern.
// 45-degree wave-front dependency: every block needs.
// Results from its three neighbors: up, left, and up-left.


Pay attention to sevearl cm-primitives related to the implementation of thread-dependency. They are cm_wait()cm_fence(), and cm_signal().

// Calculate Integral Image:
// every output pixel is the summation of all pixels in the sub-image
// from (0, 0) to (x, y). Define the equation in the recursive form
//    S(x, y) = I(x, y) + S(x-1, y) + S(x, y-1) - S(x-1, y-1)
// This example does it in 16x16 block fashion in order to fully
// utilize the funtion-unit and registers of evry GEN execution-unit
extern "C" _GENX_MAIN_
void CalcIntImage(SurfaceIndex bufin,
                  SurfaceIndex bufout)
    uint rd_h_pos = get_thread_origin_x() * 16;
    uint wr_h_pos = get_thread_origin_x() * 64;//rd_h_pos * 4;
    uint rd_v_pos = get_thread_origin_y() * 16;
    uint wr_v_pos = get_thread_origin_y() * 16;//rd_v_pos;

    matrix<uint, 16, 1> v_16_vert;
    vector<uint, 16> v_16_hori;
    matrix<uchar, 16, 16> read_in_matrix;
    matrix<uint, 16, 16> calc_matrix;
    vector<uint, 1> upleft;

    // read the 16x16 pixel block from input image
    read(bufin, rd_h_pos, rd_v_pos, read_in_matrix);
    int wr_h_pos_minus_four = wr_h_pos - 4;
    int rd_v_pos_minus_one = rd_v_pos - 1;

    // wait for the up-thread, left thread, and the up-left thread to finish

    // get the sum from the left-neighbor
    if(rd_h_pos == 0)
        v_16_vert = 0;
        read(bufout, wr_h_pos_minus_four, rd_v_pos, v_16_vert);

    // get the sum from the up-neighbor
    if(rd_v_pos == 0)
        v_16_hori = 0;
        read(bufout, wr_h_pos, rd_v_pos_minus_one,
        read(bufout, wr_h_pos+32, rd_v_pos_minus_one,

    // get the sum from the up-left corner
    if(rd_v_pos != 0 && rd_h_pos != 0){
        read(bufout, wr_h_pos_minus_four, rd_v_pos_minus_one, upleft);
        upleft = 0;

    // compute the output for this 16x16 pixel block
    acc_matrix16x16 (read_in_matrix, calc_matrix, v_16_vert, v_16_hori, upleft);

    // write the output
    write(bufout, wr_h_pos, wr_v_pos,<8,1,8,1>(0,0));
    write(bufout, wr_h_pos+32, wr_v_pos,<8,1,8,1>(0,8));
    write(bufout, wr_h_pos, wr_v_pos+8,<8,1,8,1>(8,0));
    write(bufout, wr_h_pos+32, wr_v_pos+8,<8,1,8,1>(8,8));

    // cm_fence makes sure the writes are truely finished (in the memory)
    // inform dependents