Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

C for Metal Development Package

The Intel® C for Metal development package is a software development package for Intel® Graphics Technology. It includes the Intel® C for Metal Compiler, the Intel® C for Metal Runtime, Intel® Media Driver for VAAPI, and reference examples, which can be used to develop applications accelerated by Intel® Graphics Media Accelerator. A typical application contains two kinds of source code, kernel and host. The kernel is written in Intel® C for Media language, compiled to GPU ISA binary by the Intel® C for Metal Compiler, and executed on the GPU. Host manages workloads through the Intel® C for Metal Runtime and user mode media driver.

Enqueuing Multiple Kernels

BY Li Huang ON Jun 13, 2019

Tutorial 3. Enqueuing Multiple Kernels

You may have noticed that Enqueue function takes an array of kernels. So you can enqueue multiple kernels.

Enqueuing two independent kernels

The following code-block is extracted from open_examples : multi_kernels.

In this example, two kernels are launched independently (no specific execution order). The linear kernel processes the top-half of the image, and the sepia kernel processes the bottom-half of the image.

First, create the linear kernel, notice the thread-count and thread-space are only for the half of the image.

  // Creates the linear kernel.
  // Param program: CM Program from which the kernel is created.
  // Param "linear": The kernel name which should be no more than 256 bytes
  // including the null terminator.
  CmKernel *kernel_linear = nullptr;
  cm_result_check(device->CreateKernel(program, "linear", kernel_linear));

  // Each CmKernel can be executed by multiple concurrent threads.
  // Here, for "linear" kernel, each thread works on a block of 6x8 pixels.
  // The thread width is equal to input image width divided by 8.
  // The thread height is equal to input image height divided by 6. For this
  // kernel only half of the image is processed; therefore, the thread height
  // is divided by two.
  int thread_width  = width/8;
  int thread_height = (height/6)/2;

  // Creates a CmThreadSpace object.
  // There are two usage models for the thread space. One is to define the
  // dependency between threads to run in the GPU. The other is to define a
  // thread space where each thread can get a pair of coordinates during
  // kernel execution. For this example, we use the latter usage model.
  CmThreadSpace *thread_space_linear = nullptr;
  cm_result_check(device->CreateThreadSpace(thread_width,
                                            thread_height,
                                            thread_space_linear));

  // Associates a thread space to this kernel.
  cm_result_check(kernel_linear->AssociateThreadSpace(thread_space_linear));

Second, create the sepia kernel, notice the thread-count and thread-space are also for the half of the image. Also the image height is passed into the sepia kernel. Sepia kernel is modified to process the bottom-half of the image.

  // Creates the second kernel "sepia".
  CmKernel *kernel_sepia = nullptr;
  cm_result_check(device->CreateKernel(program, "sepia" , kernel_sepia));

  // For "sepia" kernel, each thread works on a block of 8x8 pixels.
  // The thread width is equal to input image width divided by 8.
  // The thread height is equal to input image height divided by 8. For this
  // kernel only half of the image is processed; therefore, the thread height
  // is divided by two.
  thread_width = width/8;
  thread_height = (height/8)/2;

  // Creates thread space for kernel "sepia".
  CmThreadSpace *thread_space_sepia = nullptr;
  cm_result_check(device->CreateThreadSpace(thread_width,
                                            thread_height,
                                            thread_space_sepia));

  // Associates the thread space to kernel "sepia".
  cm_result_check(kernel_sepia->AssociateThreadSpace(thread_space_sepia));

Finally add both kernels to the kernel-array, and enqueue.

  // Creates a CmTask object.
  // The CmTask object is a container for CmKernel pointers. It is used to
  // enqueue the kernels for execution.
  CmTask *task = nullptr;
  cm_result_check(device->CreateTask(task));

  // Adds a CmKernel pointer to CmTask.
  // This task has two kernels, "linear" and "sepia".
  cm_result_check(task->AddKernel(kernel_linear));
  cm_result_check(task->AddKernel(kernel_sepia));

  // Creates a task queue.
  // The CmQueue is an in-order queue. Tasks get executed according to the
  // order they are enqueued. The next task does not start execution until the
  // current task finishes.
  CmQueue *queue = nullptr;
  cm_result_check(device->CreateQueue(queue));

  // Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
  // function returns immediately without waiting for the GPU to start or
  // finish execution of the task. The runtime will query the HW status. If
  // the hardware is not busy, the runtime will submit the task to the
  // driver/HW; otherwise, the runtime will submit the task to the driver/HW
  // at another time.
  // An event, "sync_event", is created to track the status of the task.
  CmEvent *sync_event = nullptr;
  cm_result_check(queue->Enqueue(task, sync_event));

Enqueuing two kernels with sync

The following code-block is extracted from open_examples:BufferTest_EnqueueWithSync.

In order to force an execution order among multiple kernels in the kernel array, you need to add synchronization.

    // Creates a CmTask object.
    // The CmTask object is a container for CmKernel pointers. It is used to
    // enqueue the kernels for execution.
    CmTask *task = nullptr;
    cm_result_check(device->CreateTask(task));

    for (int i = 0; i < KERNEL_NUM_PER_TASK; i++) {
        // Associates a thread space to this kernel.
        cm_result_check(kernel[i]->AssociateThreadSpace(thread_space));

        // When a CmBuffer is created by the CmDevice a SurfaceIndex object is
        // created. This object contains a unique index value that is mapped
        // to the CmBuffer.
        // Uses the output CmBuffer of previous kernel as the input CmBuffer of
        // this kernel.
        SurfaceIndex *input_surface_idx = nullptr;
        SurfaceIndex *output_surface_idx = nullptr;
        if (i == 0) {
            // Gets the input CmBuffer index.
            input_surface_idx = nullptr;
            buffer->GetIndex(input_surface_idx);
            // Gets the output CmBuffer index.
            output_surface_idx = nullptr;
            output_surface[i]->GetIndex(output_surface_idx);
        } else {
            // Gets the input CmBuffer index.
            input_surface_idx = nullptr;
            output_surface[i - 1]->GetIndex(input_surface_idx);
            // Gets the output CmBuffer index.
            output_surface_idx = nullptr;
            output_surface[i]->GetIndex(output_surface_idx);
        }

        // Sets a per kernel argument.
        // Sets the input CmBuffer index as the first argument of the kernel.
        // Sets the output CmBuffer index as the second argument of the kernel.
        cm_result_check(kernel[i]->SetKernelArg(0,
                                                sizeof(SurfaceIndex),
                                                input_surface_idx));
        cm_result_check(kernel[i]->SetKernelArg(1,
                                                sizeof(SurfaceIndex),
                                                output_surface_idx));

        // Adds a CmKernel pointer to CmTask.
        // This task has 16 kernels.
        cm_result_check(task->AddKernel(kernel[i]));

        // Inserts a synchronization pointer between two kernels(except for
        // the last one).
        // The 2nd kernel only will be executed after the 1st kernel finishes
        // execution.
        if (i < (KERNEL_NUM_PER_TASK - 1)) {
            cm_result_check(task->AddSync());
        }
    }