GPU support#

This section describes the main ways to run BOUT++ work efficiently on GPUs or other accelerator-style backends.

There are now two complementary levels of optimization:

  1. Write ordinary field algebra and let BOUT++ keep many algebraic expressions lazy until assignment or reduction.

  2. Drop down to explicit RAJA loops and single-index operators when you want complete control over loop fusion and kernel structure.

The first approach is usually the best starting point. The second is for hot loops where you want to manually combine derivative operators, accessors, and run-time captures in one kernel.

Automatic fusion with field expressions#

Many algebraic operations on fields can now be represented as lazy expressions. This keeps user code close to the familiar field-based style while reducing temporary fields and extra passes over memory.

Typical examples are:

Field3D rhs = sqrt(SQ(n) + SQ(T));
ddt(n) = source * profile - sink * n;
BoutReal max_error = max(abs(lhs - rhs), true);

This is the highest-level route to better execution behavior, and it is usually the most maintainable. See Field Expressions for the details of what stays lazy and when evaluation happens.

Lazy expressions mainly help with algebraic fusion. If your hot path is dominated by differential operators and you need to fuse those operators into a single explicit loop, use the lower-level approach described below.

Manual fusion with RAJA loops#

To use the single-index operators and the BOUT_FOR_RAJA loop macro:

#include "bout/single_index_ops.hxx"
#include "bout/rajalib.hxx"

To run part of a physics-model RHS on a GPU, start by copying any class member variables needed inside the loop into local variables, or capture them explicitly:

auto _setting = setting;

Then create FieldAccessor objects to read and write field data inside the loop:

auto n_acc = FieldAccessor<>(n);
auto phi_acc = FieldAccessor<>(phi);

There are also Field2DAccessor objects for Field2D. If fields are staggered, the expected location can be supplied as a template parameter:

auto Jpar_acc = FieldAccessor<CELL_YLOW>(Jpar);

Finally the loop itself can be written as:

Field3D result;
auto result_acc = FieldAccessor<>(result);

BOUT_FOR_RAJA(i, region) {
  result_acc[i] = -bracket(phi_acc, n_acc, i) - 2.0 * DDZ(n_acc, i);
};

Note the semicolon after the closing brace, which is needed because this is the body of a lambda function. Inside the loop, operators such as bracket and DDZ act at a single index i. These are the single-index operators defined in bout/single_index_ops.hxx.

Any class member variables used inside the loop must be captured carefully. Otherwise the code may compile but fail at run time on the GPU. Instead of using this implicitly, either shadow members with local variables or add them to the capture list:

BOUT_FOR_RAJA(i, region, CAPTURE(setting)) {
  ddt(n_acc)[i] = -bracket(phi_acc, n_acc, i) - 2 * DDZ(n_acc, i);
  /* ... code that uses `setting` ... */
};

If RAJA is not available, the BOUT_FOR_RAJA macro will revert to BOUT_FOR. For testing, this can be forced by defining DISABLE_RAJA before including bout/rajalib.hxx.

Note: An important difference between BOUT_FOR and BOUT_FOR_RAJA (apart from the closing semicolon) is that the type of the index i is different inside the loop: BOUT_FOR uses SpecificInd types (typically Ind3D), but BOUT_FOR_RAJA uses int. SpecificInd can be explicitly cast to int so use static_cast<int>(i) to ensure that it’s an integer both with and without RAJA. This might (hopefully) change in future versions.

Choosing between the two approaches#

Use lazy field expressions when:

  • the code is mostly algebraic combinations of existing fields

  • readability matters more than extracting the last bit of performance

  • you want a clear default path that still maps well to accelerator backends

Use explicit RAJA loops and single-index operators when:

  • a hot loop is dominated by derivatives

  • you want to combine many operations into one kernel manually

  • you need direct control over captures, data access, or loop structure

Examples#

The blob2d-outerloop example is the simplest one which uses single index operators and (optionally) RAJA. It should solve the same set of equations, with the same inputs, as blob2d.

hasegawa-wakatani-3d is a 3D turbulence model, typically solved in a slab geometry.

elm-pb-outerloop is a much more complicated model, which should solve the same equations, and have the same inputs, as elm-pb. Note that there are some differences:

  • The numerical methods used in elm-pb can be selected at run-time, and typically include WENO schemes e.g W3. In elm-pb-outerloop the methods are fixed to C2 in all cases.

  • The equations solved by elm-pb can be changed by modifying input settings. To achieve higher performance, elm-pb-outerloop does this at compile time. There are checks to ensure that the code has been compiled with flags consistent with the input settings. See the README file for more details.

Notes:

  • When RAJA is used in a physics model, all members of the PhysicsModel should be public. If this is not done, then a compiler error like “error: The enclosing parent function (“rhs”) for an extended __device__ lambda cannot have private or protected access within its class” may be encountered.

  • Class member variables cannot in general be used inside a RAJA loop: The this pointer is captured by value in the lambda function, not the value of the member variable. When the member variable is used on the GPU, the this pointer is generally not valid (e.g. on NERSC Cori GPUs). Some architectures have Address Translation Services (ATS) which enable host pointers to be resolved on the GPU.

CMake configuration#

To compile BOUT++ components into GPU kernels, a few different pieces need to work together: RAJA, Umpire, and a CUDA-capable compiler.

The generated eager field-operator code also selects a loop backend at configure time. If RAJA is enabled it uses RAJA loops, otherwise it falls back to OpenMP or serial loops depending on the build.

Table 18 Useful CMake configuration settings#

Name

Description

Default

BOUT_ENABLE_RAJA

Include RAJA header, use RAJA loops

Off

BOUT_ENABLE_UMPIRE

Use Umpire to allocate Array memory

Off

BOUT_ENABLE_CUDA

Compile with nvcc compiler

Off

CUDA_ARCH

Set CUDA architecture to compile for

compute_70,code=sm_70

BOUT_ENABLE_WARNINGS

nvcc has incompatible warning flags

On (turn Off for CUDA)

Shifted metric on GPUs#

When BOUT++ is built with CUDA, the shifted-metric parallel transform has a CUDA implementation of its toroidal shiftZ work used while calculating parallel slices during communication.

This is most relevant when using:

[mesh:paralleltransform]
type = shifted
calcParallelSlices_on_communicate = true

The current implementation is specialized for supported power-of-two LocalNz values. If parallel slices are disabled on communicate, as in the aligned-transform workflow, this precomputed-slice path is not used.

Single index operators#

BOUT++ models are typically implemented using operations which take in fields, perform an operation, and return a new field. These are convenient, but the consequence is that an expression like Grad_par(phi) * B0 contains two loops over the domain, one for the gradient operator Grad_par, and another for the multiplication. Complex models can contain dozens of these loops. When using OpenMP or GPU threads this results in many small kernels being launched, and typically poor efficiency.

The “single index operators” in BOUT++ offer a way to manually combine loops over the domain into a smaller number of loops. It is perhaps less convenient than a template expression system might be, but considerably easier to debug and maintain.

Single index operators have the same name as field operations, but the interface has two key differences:

  1. The functions take an index as an additional final argument

  2. Rather than fields (e.g Field2D, Field3D types), these operate on field accessors (Field2DAccessor, FieldAccessor types). These offer faster, more direct, but less safe access to the underlying data arrays.

For example a simple field operation:

Field3D phi;
...
Field3D result = DDX(phi);

might be written as:

Field3D phi;
...
Field3D result;

// Create accessors for fast (unsafe) data access
auto phi_acc = FieldAccessor<>(phi);
auto result_acc = FieldAccessor<>(result);

BOUT_FOR_RAJA(i, result.region("RGN_NOBNDRY")) {
  result_acc[i] = DDX(phi_acc, i);
}

For a simple example like this the code is harder to read, and there is not much benefit because there is one loop over the domain in both cases. The benefit becomes more apparent when multiple operations are combined.

The operators are implemented in a header file, so that they can be inlined. These are in include/bout/single_index_ops.hxx. Table Table 19 lists the functions which have been implemented.

Table 19 Single index operator functions#

Function

Description

DDX(F3D, i)

Derivative in X with ddx:first=C2

DDY(F3D, i)

Derivative in Y with ddy:first=C2

DDZ(F3D, i)

Derivative in Z with ddz:first=C2

bracket(F2D, F3D, i)

bracket(F2D, F3D, BRACKET_ARAKAWA)

bracket(F3D, F3D, i)

bracket(F3D, F2D, BRACKET_ARAKAWA)

Delp2(F3D, i)

Delp2 with useFFT=false, C2 central diff.

Div_par_Grad_par(F3D, i)

2nd order central difference

b0xGrad_dot_Grad(F3D, F2D, i)

C2 central diff. for all derivatives

b0xGrad_dot_Grad(F2D, F3D, i)

C2 central diff. for all derivatives

D2DY2(F3D, i)

C2 2nd-order derivative ddy:second=C2

Grad_par(F3D, i)

C2 derivative, ddy:first=C2

Note that for efficiency the method used in the single index operators cannot be changed at runtime: The numerical method is fixed at compile time. The DDX single index operator, for example, always uses 2nd order central difference.

Unit tests which ensure that the single index operators produce the same result as the original field operations are in tests/unit/include/bout/test_single_index_ops.cxx. Note that there are limitations to these unit tests, in particular the geometry may not be fully exercised. Additional errors are possible when combining these methods, or porting code from field operations to single index operations. It is therefore essential to have integrated tests and benchmarks for each model implementation: Unit tests are necessary but not sufficient for correctness.

CoordinatesAccessor#

The differential operators used in physics models typically need access to grid spacing (eg. dx), non-uniform grid corrections (e.g. d1_dx), and multiple coordinate system fields (metric tensor components). When a FieldAccessor is created from a field, it uses the field’s coordinate system to create a CoordinateAccessor, which provides fast access to this extra data.

The coordinate system data is usually stored in separate arrays, so that many different pointers would need to be passed to the CUDA kernels to use this data directly. This was found to cause compilation errors with nvcc along the lines of “Formal parameter space overflowed”.

The CoordinatesAccessor reduces the number of parameters (data pointers) by packing all Coordinates data (grid spacing, metric tensor components) into a single array. The ordering of this data in the array has not been optimised, but is currently striped: Data for the same grid cell is close to each other in memory. Some guidance on memory layout can be found on the NVidia website and might be used to improve performance in future. It is likely that the results might be architecture dependent.

To minimise the number of times this data needs to be copied from individual fields into the single array, and then copied from CPU to GPU, CoordinatesAccessors are cached. A map (coords_store defined in coordinates_accessor.cxx) associates Array<BoutReal> objects (containing the array of data) to Coordinates pointers. If a CoordinatesAccessor is constructed with a Coordinates pointer which is in the cache, then the previously created Array data is used. Some care is therefore needed if the Coordinates data is modified, to ensure that a new CoordinatesAccessor data array is created by clearing the old data from the cache.

The easiest way to clear the cache is to call the static function CoordinatesAccessor::clear(), which will delete all arrays from the cache. To remove a single Coordinates key from the cache, pass the pointer to CoordinatesAccessor::clear(coordinates_ptr). In both cases the number of keys removed from the cache will be returned.

Memory allocation and Umpire#

Using GPUs effectively requires keeping track of even more levels of memory than usual. An extra complication is that trying to dereference a pointer to CPU memory while on the GPU device (or a device memory pointer while on the CPU) will result in a segfault on some architectures, while other architectures with Address Translation Services (ATS) will trap this access and transfer the required memory addresses, with a corresponding performance penalty for the time this transfer takes.

At a low level, CPU and GPU memory are allocated separately, with buffers being explicitly synchronised by data transfer. To do this allocation, and automatically move data from CPU to GPU or back when needed, BOUT++ uses Umpire . In order for this to work with data structures or multiple indirections, all steps in chain of pointers must be in the right place (CPU or device). Allocating everything with Umpire is the easiest way to ensure that this is the case.

The calculations done in BOUT++ typically involve using blocks of memory of the a few common sizes, and the same calculations are done every timestep on different data as the simulation state evolves. BOUT++ therefore uses an arena system to store arrays which have been released, so that they can be re-used rather than deleted and allocated. Allocator chaining is used: If the object pool runs out of arrays of the requested size, then a new one is allocated using Umpire or the native allocator (new).

This is a good talk by John Lakos [ACCU 2017] on memory allocators

Future work#

The GPU path is still evolving. The main long-term direction is to let more of ordinary field code map efficiently onto accelerator backends, so that manual kernel construction is only needed for the most performance-critical cases. due to the need to transform CPU data structures into a form which can be passed to and used on the GPU. In the bout/rajalib.hxx header there is code like:

auto indices = n.getRegion("RGN_NOBNDRY").getIndices();
 Array<int> _ob_i_ind(indices.size()); // Backing data is device safe
 // Copy indices into Array
 for(auto i = 0; i < indices.size(); i++) {
   _ob_i_ind[i] = indices[i].ind;
 }
 // Get the raw pointer to use on the device
 auto _ob_i_ind_raw = &_ob_i_ind[0];

which is creating a raw pointer (_ob_i_ind_raw) to an array of ints which are allocated using Umpire. The original indices are allocated using new and are inside a C++ std::vector. The RAJA loop then uses this array like this:

RAJA::forall<EXEC_POL>(RAJA::RangeSegment(0, indices.size()), [=] RAJA_DEVICE(int id) {
  int i = _ob_i_ind_raw[id]; // int passed to loop body

This code has several issues:

  1. It is inefficiently creating a new Array<int> and copying the indices into it every time. In almost every case the indices will not be changing.

  2. The indices lose their type information: Inside the loop an index into a 3D field has the same type as an index into a 2D field (both int). This is a possible source of bugs.

Possible fixes include:

  1. Changing Region to store indices inside an Array rather than std::vector. This would ensure that the SpecificInd objects were allocated with Umpire. Then the GPU-side code could use SpecificInd objects for index conversion and type safety. This would still leave the problem of extracting the pointer from the Array, and would send more information to the GPU (SpecificInd contains 3 ints).

  2. The indices could be stored in two forms, one the std::vector as now, and alongside it an Array<int>.

In either case it might be useful to have an ArrayAccessor type, which is just a range (begin/end pair, or pointer and length), and doesn’t take ownership of the array data.

Then the code might look something like:

auto indices_acc = ArrayAccessor<>(n.getRegion("RGN_NOBNDRY").getIndices());

RAJA::forall<EXEC_POL>(RAJA::RangeSegment(0, indices.size()), [=] RAJA_DEVICE(int id) {
  const Ind3D& i = indices_acc[id];