Sorry, you need to enable JavaScript to visit this website.


Your feedback is important to keep improving our website and offer you a more reliable experience.

C for Metal Development Package

The Intel® C for Metal development package is a software development package for Intel® Graphics Technology. It includes the Intel® C for Metal Compiler, the Intel® C for Metal Runtime, Intel® Media Driver for VAAPI, and reference examples, which can be used to develop applications accelerated by Intel® Graphics Media Accelerator. A typical application contains two kinds of source code, kernel and host. The kernel is written in Intel® C for Media language, compiled to GPU ISA binary by the Intel® C for Metal Compiler, and executed on the GPU. Host manages workloads through the Intel® C for Metal Runtime and user mode media driver.

Basic Host Programming

BY Li Huang ON Jun 13, 2019

Tutorial 1. Basic Host Programming

Most of the code in these tutorials are extracted from our open examples.

In this tutorial, we are showing the sample host code that uses the CM runtime API directly. The sample code may look a little verbose, however, developers can come up some higher-level utility library on top of the CM runtime to make their code more concise.

Step 1. Create CM Device

  // Creates a CmDevice from scratch.
  // Param device: pointer to the CmDevice object.
  // Param version: CM API version supported by the runtime library.
  CmDevice *device = nullptr;
  unsigned int version = 0;
  cm_result_check(::CreateCmDevice(device, version));

Step 2. Load Program

  • CM compilation happens at two stages
    • Offline: cmc compiles CM to virtual ISA
    • Just-In-Time: virtual ISA to target ISA
  • LoadProgram: load virtual ISA into runtime
    • JIT-compilation happens during LoadProgram
  // The file linear_walker_genx.isa is generated when the kernels in the file
  // linear_walker_genx.cpp are compiled by the CM compiler.
  // Reads in the virtual ISA from "linear_walker_genx.isa" to the code
  // buffer.
  std::string isa_code = cm::util::isa::loadFile("linear_walker_genx.isa");
  if (isa_code.size() == 0) {
    std::cerr << "Error: empty ISA binary.\n";

  // Creates a CmProgram object consisting of the kernels loaded from the code
  // buffer.
  // Param Pointer to the code buffer containing the virtual
  // ISA.
  // Param isa_code.size(): Size in bytes of the code buffer containing the
  // virtual ISA.
  CmProgram *program = nullptr;

Step 3. Create Kernel

Retrieve the target binary of a kernel from a loaded program.

  // Creates the linear kernel.
  // Param program: CM Program from which the kernel is created.
  // Param "linear": The kernel name which should be no more than 256 bytes
  // including the null terminator.
  CmKernel *kernel = nullptr;

Step 4. Create Surfaces

  // Creates input surface with given width and height in pixels and format.
  // Sets surface format as CM_SURFACE_FORMAT_A8R8G8B8. For this format, each
  // pixel occupies 32 bits.
  // The input image is RGB format with 24 bits per pixel, and the surface
  // format is A8R8G8B8 with 32 bits per pixel. Therefore, the surface width
  // is (width*3/4) in pixels.
  CmSurface2D *input_surface = nullptr;

  // Copies system memory content to the input surface using the CPU. The
  // system memory content is the data of the input image. The size of data
  // copied is the size of data in the surface.
  cm_result_check(input_surface->WriteSurface(input_image.getData(), nullptr));

  // Creates the output surface. The width, height and format is the same as
  // the input surface.
  CmSurface2D *output_surface = nullptr;

Step 5. Create Thread Space

This function sets up a hardware mechanism called media-walker for launching threads. Media-walker generates thread-identifiers, and puts them into thread-payloads. CM kernel program can get thread-ids using CM intrinsics.

Media-walker is the preferred way of doing GEN media programming, which has lower driver overhead (less work in preparing the commands) and faster enqueue time.

  // Each CmKernel can be executed by multiple concurrent threads.
  // Here, for "linear" kernel, each thread works on a block of 6x8 pixels.
  // The thread width is equal to input image width divided by 8.
  // The thread height is equal to input image height divided by 6.
  int thread_width = width/8;
  int thread_height = height/6;

  // Creates a CmThreadSpace object.
  // There are two usage models for the thread space. One is to define the
  // dependency between threads to run in the GPU. The other is to define a
  // thread space where each thread can get a pair of coordinates during
  // kernel execution. For this example, we use the latter usage model.
  CmThreadSpace *thread_space = nullptr;

Step 6. Set Kernel Arguments

Kernel argument is dynamic constant for all threads. Value is logged at the time of setting kernel arg. The size of total kernel arguments has to be less than 256 bytes. For linear filter, we need to pass surface index as kernel arguments.

  // When a surface is created by the CmDevice a SurfaceIndex object is
  // created. This object contains a unique index value that is mapped to the
  // surface.
  // Gets the input surface index.
  SurfaceIndex *input_surface_idx = nullptr;

  // Sets a per kernel argument.
  // Sets input surface index as the first argument of linear kernel.

  // Gets the output surface index.
  SurfaceIndex *output_surface_idx = nullptr;

  // Sets output surface index as the second argument of linear kernel.

Step 7. Enqueue Kernels/Launch GPU Work

Notice that a CM event is created for the enqueue call. That is for tracking the job status.

  // Creates a task queue.
  // The CmQueue is an in-order queue. Tasks get executed according to the
  // order they are enqueued. The next task does not start execution until the
  // current task finishes.
  CmQueue *cmd_queue = nullptr;

  // Creates a CmTask object.
  // The CmTask object is a container for CmKernel pointers. It is used to
  // enqueue the kernels for execution.
  CmTask *task = nullptr;

  // Adds a CmKernel pointer to CmTask.
  // This task has one kernel, "linear".

  // Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
  // function returns immediately without waiting for the GPU to start or
  // finish execution of the task. The runtime will query the HW status. If
  // the hardware is not busy, the runtime will submit the task to the
  // driver/HW; otherwise, the runtime will submit the task to the driver/HW
  // at another time.
  // An event, "sync_event", is created to track the status of the task.
  CmEvent *sync_event = nullptr;

Step 8. Getting Results and Execution Time

Notice that CM event is used when we read the output surface, and it is used to query execution time. CmEvent must be destroyed by user explicitly.

  // Destroys a CmTask object.
  // CmTask will be destroyed when CmDevice is destroyed.
  // Here, the application destroys the CmTask object by itself.

  // Destroy a CmThreadSpace object.
  // CmThreadSpace will be destroyed when CmDevice is destroyed.
  // Here, the application destroys the CmThreadSpace object by itself.

  // Reads the output surface content to the system memory using the CPU.
  // The size of data copied is the size of data in Surface.
  // It is a blocking call. The function will not return until the copy
  // operation is completed.
  // The dependent event "sync_event" ensures that the reading of the surface
  // will not happen until its state becomes CM_STATUS_FINISHED.

  // Queries the execution time of a task in the unit of nanoseconds.
  // The execution time is measured from the time the task started execution
  // in the GPU to the time when the task finished execution.
  UINT64 execution_time = 0;
  std::cout << "Kernel linear execution time is " << execution_time
      << " nanoseconds" << std::endl;

  // Destroys the CmEvent.
  // CmEvent must be destroyed by the user explicitly.

  // Destroys the CmDevice.
  // Also destroys surfaces, kernels, tasks, thread spaces, and queues that
  // were created using this device instance that have not explicitly been
  // destroyed by calling the respective destroy functions.

  // Saves the output image data into the file "linear_out.bmp"."linear_out.bmp");