Performance profiling#
Analyzing code behaviour is vital for getting the best performance from BOUT++. This is done by profiling the code, that is, building and running the code using tools that report the amount of time each processor spends in functions, on communications, etc.
This section describes how to compile and run BOUT++ using the Scorep/Scalasca and Extrae/Paraver tool chains. Both are suitable for analyzing code parallelized with MPI and/or OpenMP. Scorep+Scalasca gives timings and call trees for each processor/thread, while Extrae/Paraver produces visualizations showing what each processor/thread is doing at a point in time.
Scorep/Scalasca profiling#
Instrumentation#
Scorep automatically reports the time spent in MPI communications and OpenMP
loops. However, to obtain information on the time spent in specific functions,
it is necessary to instrument the source code. The macros to do this are
provided in scorepwrapper.hxx
.
To include a function in Scorep’s timing, include the scorep wrapper in the source code
#include <bout/scorepwrapper.hxx>
and then write the macro SCOREP0()
at the top of the function, e.g.
int Field::getNx() const{
SCOREP0();
return getMesh()->LocalNx;
};
Regions of a function can also be timed by enclosing the region in braces and using the
BOUT_SCOREP_REGION
macro. For example,
void Field2D::applyBoundary(BoutReal time) {
SCOREP0();
checkData(*this);
{
BOUT_SCOREP_REGION("display name");
for (const auto& bndry : bndry_op) {
bndry->apply(*this, time);
}
}
};
Here, the SCOREP0
macro ensures the whole applyBoundary
function is timed. In
addition, the for loop is also timed and appears in the Scalasca profile as a region
inside applyBoundary
with the name “display name”. Any number of Scorep user regions
can be used in a function; user regions can also be nested.
Caution Instrumenting a function makes it execute more slowly. This can result in misleading profiling information, particularly if fast-but-frequently-called functions are instrumented. Try to instrument significant functions only.
The profiling overhead in sensibly-instrumented code should be only a few percent of runtime.
Configure and build#
Configure with -BOUT_USE_SCOREP
to enable Scorep instrumentation,
then build as normal. This option can be combined with other options,
but it is usually desirable to profile the optimized code, configuring
with the flags ``. Build the code with ``make
as normal.
With CMake:
$ SCOREP_WRAPPER=off cmake \
-DCMAKE_C_COMPILER=scorep-mpicc \
-DCMAKE_CXX_COMPILER=scorep-mpicxx \
-DCMAKE_CXX_FLAGS=-O3 -DCHECK=0 \
<other CMake options>
This will turn off the instrumentation during the configure
step. Please be aware that if you change CMakeLists.txt
, CMake
will try to automatically reconfigure the build, which the Score-P
wrappers interfere with. In this case you will need to restart the
configure step from scratch (i.e. remove the build directory and start
again).
Run and analysis#
When running the code, prepend the run command with scalasca -analyze
, e.g.
$ scalasca -analyze mpirun -np 2 elm_pb
The run then produces an “archive” containing profiling data in a directory
called scorep_<exec_name>_<proc_info>_sum
. To view the profiling
information with the cube viewer, do
$ cube scorep_<exec_name>_<proc_info>_sum/profile.cubex
Note that Scorep does not run if doing so would produce an archive with the same name as an existing archive. Therefore to rerun an executable on the same number of processors, it is necessary to move or delete the first archive.
Machine-specific installation#
These are some configurations which have been found to work on particular machines.
Archer#
As of 23rd January 2019, the following configuration should work
$ module swap PrgEnv-cray PrgEnv-gnu
$ module load fftw
$ module load archer-netcdf/4.1.3
$ module load scalasca
Note that due to a bug in the CC
compiler, it is necessary to modify
make.config
after configuration if profiling OpenMP-parallelized code:
add the flag
-fopenmp
toBOUT_FLAGS
add the flag
--thread=omp:ancestry
as an argument toscorep
inCXX
Extrae/Paraver profiling#
Extrae is a powerful tool allowing visualization of commumication and computation in parallel codes. It requires minimal instrumentation; however the trace files produced can be extremely large.
Instrumentation, configure and build#
No changes to the code are necessary. On some systems, environment variables must be set before building. Otherwise, compile and build as normal.
Run#
To run, add a trace script into the normal run command, so that for example
$ aprun -n 16 blob2d -d delta_1
becomes
$ aprun -n 16 ./trace.sh blob2d -d delta_1
where trace.sh
is the script file
#!/bin/bash
export EXTRAE_CONFIG_FILE=./extrae.xml
export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so
$*
The run directory must also contain the file extrae.xml
, which configures
which data Extrae collects. Example extrae.xml
files may be found in
${EXTRAE_HOME}/share/example/*/extrae.xml
Running produces a file called TRACE.mpits
. To generate the .prv
trace
file that can be read by Paraver, do
TRACE_NAME=bout.prv
${EXTRAE_HOME}/bin/mpi2prv -f ${EXTRAE_WORK_DIR}/TRACE.mpits -o ${TRACE_NAME}
Analysis#
Open the trace file in Paraver with
$ wxparaver ${TRACE_NAME}
To view time traces, go to File -> Load Configuration
. There are many
configurations to choose from! Two useful configurations are:
mpi/views/MPI_call.cfg
to show when MPI calls are madeGeneral/views/useful_duration.cfg
to show continuous bursts of computation
Reducing trace file size#
When trace files are very large, Paraver will prompt the user to filter or cut the file to reduce its size. Filtering removes some information from the trace, making it small enough to open and allow the user to select a region of interest. Cutting crops the trace to a region of interest. Both operations create new trace files, and never overwrite the original trace.
The following prescription should work for manipulating large trace files:
Open the large trace file in Paraver and click ‘Yes’ to filter it
Click on the tick box ‘Filter’
- Filter the trace file:
select box for Events
select box for Communications
in ‘Keep States’ select box for ‘Running’
in ‘Keep States’ select box for ‘IO’
select a min duration of 1000
click ‘Apply’
View ‘useful duration’ configuration and locate the region of interest
Zoom into the region of interest, and start and end the zoom on equivalent large sections of computation (blue/green)
Right click -> Run -> Cutter
Change the ‘Input’ trace file to cut from the filtered to the original one.
Click cut.
This produces a trace file which has all the original profiling information, but is much smaller as it is limited in time to a region of interest.
Machine-specific installation#
These are some configurations which have been found to work on particular machines.
Archer#
As of 1st February 2019, the following configuration should work
$ module swap PrgEnv-cray PrgEnv-gnu
$ module load fftw
$ module load archer-netcdf/4.1.3
$ module load papi
$ module load bsctools/extrae
$
$ export CRAYPE_LINK_TYPE=dynamic
Note that due to a bug in the CC
compiler, it is necessary to modify
make.config
after configuration to add the flag -fopenmp
to
BOUT_FLAGS
, when profiling OpenMP-parallelized code.