OpenMP-Aware Optimizations

LLVM, since version 11 (12 Oct 2020), supports an OpenMP-Aware optimization pass. This optimization pass will attempt to optimize the module with OpenMP-specific domain-knowledge. This pass is enabled by default at high optimization levels (O2 / O3) if compiling with OpenMP support enabled.

OpenMPOpt

OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on the entire Module, and later on the entire call graph. Most optimizations done by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling with the following flags.

$ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt

OpenMP Runtime Call Deduplication

The OpenMP runtime library contains several functions used to implement features of the OpenMP standard. Several of the runtime calls are constant within a parallel region. A common optimization is to replace invariant code with a single reference, but in this case the compiler will only see an opaque call into the runtime library. To get around this, OpenMPOpt maintains a list of OpenMP runtime functions that are constant and will manually deduplicate them.

Globalization

The OpenMP standard requires that data can be shared between different threads. This requirement poses a unique challenge when offloading to GPU accelerators. Data cannot be shared between the threads in a GPU by default, in order to do this it must either be placed in global or shared memory. This needs to be done every time a variable may potentially be shared in order to create correct OpenMP programs. Unfortunately, this has significant performance implications and is not needed in the majority of cases. For example, when Clang is generating code for this offloading region, it will see that the variable x escapes and is potentially shared. This will require globalizing the variable, which means it cannot reside in the registers on the device.

void use(void *) { }

void foo() {
  int x;
  use(&x);
}

int main() {
#pragma omp target parallel
  foo();
}

In many cases, this transformation is not actually necessary but still carries a significant performance penalty. Because of this, OpenMPOpt can perform and inter-procedural optimization and scan each known usage of the globalized variable and determine if it is potentially captured and shared by another thread. If it is not actually captured, it can safely be moved back to fast register memory.

Another case is memory that is intentionally shared between the threads, but is shared from one thread to all the others. Such variables can be moved to shared memory when compiled without needing to go through the runtime library. This allows for users to confidently declare shared memory on the device without needing to use custom OpenMP allocators or rely on the runtime.

static void share(void *);

static void foo() {
  int x[64];
#pragma omp parallel
  share(x);
}

int main() {
  #pragma omp target
  foo();
}

These optimizations can have very large performance implications. Both of these optimizations rely heavily on inter-procedural analysis. Because of this, offloading applications should ideally be contained in a single translation unit and functions should not be externally visible unless needed. OpenMPOpt will inform the user if any globalization calls remain if remarks are enabled. This should be treated as a defect in the program.

Resources