Sorry, you need to enable JavaScript to visit this website.

C for Metal Development Package

The Intel® C for Metal development package is a software development package for Intel® Graphics Technology. It includes the Intel® C for Metal Compiler, the Intel® C for Metal Runtime, Intel® Media Driver for VAAPI, and reference examples, which can be used to develop applications accelerated by Intel® Graphics Media Accelerator. A typical application contains two kinds of source code, kernel and host. The kernel is written in Intel® C for Media language, compiled to GPU ISA binary by the Intel® C for Metal Compiler, and executed on the GPU. Host manages workloads through the Intel® C for Metal Runtime and user mode media driver.

Shared Local Memory and Thread Group

BY Li Huang ON Jun 13, 2019

Tutorial 6. Shared Local Memory and Thread Group

CM also allows users to use shared-local-memory (SLM) that can be shared among a group of threads. On GEN, SLM is carved out of the level-3 cache, and reconfigured to be 16-way banked. A group of threads that share SLM will be dispatched to the same half-slice. The maximum size for SLM is 64KB.

SLM is useful when you want data-sharing among a group of threads. Because it has more banks than L3 and is user-program controlled. It can be more efficient than L3-data-cache for scattered read and write. The following are the typical steps for using SLM and thread-grouping in CM.

These code are extracted from nbody_SLM_release in examples.

Host Program: CreateThreadGroupSpace

One important note: CreateThreadGroupSpace will put GPU thread-dispatching into GPGPU mode, which is different from the media-Walker mode, therefore the thread dependence setting, which is associated with the media-walker, are not available when thread groups are in use.

  // Each CmKernel can be executed by multiple concurrent threads.
  // Calculates the number of threads to spawn on the GPU for this kernel.
  int threads = num_bodies / BODIES_CHUNK;

  // In this case, we want to maximize the group size to get the most
  // data-share, so we need to query the maximum group size that target
  // machine can support.
  size_t size = 4;
  int max_thread_count_per_thread_group = 0;
  cm_result_check(device->GetCaps(
      CAP_USER_DEFINED_THREAD_COUNT_PER_THREAD_GROUP,
      size,
      &max_thread_count_per_thread_group));
  int group_count = (threads + max_thread_count_per_thread_group - 1) /
      max_thread_count_per_thread_group;
  while (threads % group_count != 0) {
    group_count++;
  }

  // Creates a thread group space.
  // This function creates a thread group space specified by the height and
  // width dimensions of the group space, and the height and width dimensions
  // of the thread space within a group.In the GPGPU mode, the host program
  // needs to specify the group space and the thread space within each group.
  // This group and thread space information can be subsequently used to
  // execute a kernel in that space later.
  CmThreadGroupSpace *thread_group_space = nullptr;
  cm_result_check(device->CreateThreadGroupSpace(threads / group_count,
                                                 1,
                                                 group_count,
                                                 1,
                                                 thread_group_space));

Host Program: EnqueueWithGroup

  // Launches the task on the GPU. Enqueue is a non-blocking call, i.e. the
  // function returns immediately without waiting for the GPU to start or
  // finish execution of the task. The runtime will query the HW status. If
  // the hardware is not busy, the runtime will submit the task to the
  // driver/HW; otherwise, the runtime will submit the task to the driver/HW
  // at another time.
  // An event, "sync_event", is created to track the status of the task.
  CmEvent *sync_event = nullptr;
  cm_result_check(cmd_queue->EnqueueWithGroup(task,
                                              sync_event,
                                              thread_group_space));

Kernel Program

Several builtin function worth attention in this programs are cm_slm_initcm_slm_alloc, and cm_slm_load.

extern "C" _GENX_MAIN_ void cmNBody(SurfaceIndex INPOS, SurfaceIndex INVEL,
                                    SurfaceIndex OUTPOS, SurfaceIndex OUTVEL,
                                    float deltaTime, float damping,
                                    float softeningSquared, int numBodies) {

    // Only 4K bodies fit in SLM
    // 1. Foreach 4K bodies - For a total of 16K bodies
    // 2.   LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM
    // 3.   Foreach MB (32 bodies here) - For a total of 4 MBs
    // 4.     READ from Memory: Position of thisThreadBodies
    // 6.     Foreach set of 32 bodies In 4K SLM bodies
    // 7        READ from SLM: Position of 32 bodies
    // 8.       Compute Interaction between thisThreadBodies and the 32
    //            bodies read from SLM; Compute and update force0, force1,
    //            force2 for forces in 3D
    // 9.     READ from Memory: Velocity of thisThreadBodies
    // 10.     Compute New Velocity and New Position of thisThreadBodies
    // 11.     WRITE to Memory: New Velocity of thisThreadBodies
    // 12.     WRITE to Memory: New Position of thisThreadBodies

    cm_slm_init(SLM_SIZE);
    uint bodiesInSLM = cm_slm_alloc(SLM_SIZE);

    gThreadID = cm_linear_global_id();
    force0 = force1 = force2 = 0.0f;

    // 1. Foreach 4K bodies - For a total of 16K bodies
    for (int iSLM = 0; iSLM < 4; iSLM++) {

        // 2. LOAD 4K bodies to SLM: i.e. Read from Memory and Write to SLM

        cm_slm_load(
            bodiesInSLM,     // slmBuffer   : SLM buffer
            INPOS,           // memSurfIndex: Memory SurfaceIndex
            iSLM * SLM_SIZE, // memOffset   : Byte-Offset in Memory Surface
            SLM_SIZE         // loadSize    : Bytes to be Loaded from Memory
            );

        // Each thread needs to process 4 Macro-Blocks (MB):
        //            One MB = 32 Bodies; Total 2 Groups with 64 threads/Group
        //            => #Bodies/Thread = TotalNumBodies/TotalNumThreads
        //                              = 16384/128 = 128 = 4 MBs
        //   - Depending on the number of threads, the number of MBs per
        //     threads can be changed by just changing this loop-count
        //   - For optimization purpose, if there are enough GRFs we can
        //     process more MBs per iteration of this loop - in that case
        //     need to change the loop-stride accordingly; If all MBs can
        //     be processed in the GRF, we can eliminate this loop

        for (int iMB = 0; iMB < 4; iMB++) {
            cmk_Nbody_ForEachMB_ForEachSLMBlock(
                INPOS, deltaTime, softeningSquared, BODIES_PER_SLM, bodiesInSLM,
                iMB);
        } // end foreach(MB)
    }     // end foreach(SLM block)

    for (int iMB = 0; iMB < 4; iMB++) {
        cmk_Nbody_OutputVelPos_ForEachMB(INPOS, INVEL, OUTPOS, OUTVEL,
                                         deltaTime, damping, iMB);
    } // end foreach(MB)
}