Transformed generic-mode kernel to SPMD-mode [OMP120]

This optimization remark indicates that the execution strategy for the OpenMP target offloading kernel was changed. Generic-mode kernels are executed by a single thread that schedules parallel worker threads using a state machine. This code transformation can move a kernel that was initially generated in generic mode to SPMD-mode where all threads are active at the same time with no state machine. This execution strategy is closer to how the threads are actually executed on a GPU target. This is only possible if the instructions previously executed by a single thread have no side-effects or can be guarded. If the instructions have no side-effects they are simply recomputed by each thread.

Generic-mode is often considerably slower than SPMD-mode because of the extra overhead required to separately schedule worker threads and pass data between them.This optimization allows users to use generic-mode semantics while achieving the performance of SPMD-mode. This can be helpful when defining shared memory between the threads using OMP111.

Examples

Normally, any kernel that contains split OpenMP target and parallel regions will be executed in generic-mode. Sometimes it is easier to use generic-mode semantics to define shared memory, or more tightly control the distribution of the threads. This shows a naive matrix-matrix multiplication that contains code that will need to be guarded.

void matmul(int M, int N, int K, double *A, double *B, double *C) {
#pragma omp target teams distribute collapse(2) \
  map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
  for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
      double sum = 0.0;

#pragma omp parallel for reduction(+:sum) default(firstprivate)
      for (int k = 0; k < K; k++)
        sum += A[i*K + k] * B[k*N + j];

      C[i*N + j] = sum;
    }
  }
}
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
     double sum = 0.0;
            ^
omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
#pragma omp target teams distribute collapse(2) \
^

This requires guarding the store to the shared variable sum and the store to the matrix C. This can be thought of as generating the code below.

void matmul(int M, int N, int K, double *A, double *B, double *C) {
#pragma omp target teams distribute collapse(2) \
  map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
  for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
    double sum;
#pragma omp parallel default(firstprivate) shared(sum)
    {
    #pragma omp barrier
    if (omp_get_thread_num() == 0)
      sum = 0.0;
    #pragma omp barrier

#pragma omp for reduction(+:sum)
      for (int k = 0; k < K; k++)
        sum += A[i*K + k] * B[k*N + j];

    #pragma omp barrier
    if (omp_get_thread_num() == 0)
      C[i*N + j] = sum;
    #pragma omp barrier
    }
    }
  }
}

Diagnostic Scope

OpenMP target offloading optimization remark.