Fortran

Compiler Optimisations (Release Builds)

Subcategory: Core

Modern compilers support many automatic low-level optimisations which can improve the performance of the compiled code. The actual performance gain depends on your specific code, but it is not uncommon to see 5–10x speedups, and in some cases, 50x or more, especially for compute-intensive tasks.

Modern compilers support many automatic low-level optimisations which can improve the performance of the compiled code. The actual performance gain depends on your specific code, but it is not uncommon to see 5–10x speedups, and in some cases, 50x or more, especially for compute-intensive tasks.

Compiler optimisations are not enabled by default because they make debugging more difficult. Optimised builds can:

Reorder or eliminate code, making it harder to set breakpoints or inspect variables
Inline functions, obscuring the call stack
Remove or coalesce variables that appear unused

As a result, developers typically use unoptimised builds (debug) during development and optimised builds (release) for production.

Enabling Optimisations

If you wish to enable compiler optimisations (or check whether they are enabled) the guidance differs slightly between compilers.

GCC / Clang

The main optimisation flags provided by GCC and Clang are:

-O0: No optimisation (default)
-O1, -O2, -O3: Increasing levels of optimisation
-Ofast: Aggressive optimisations, may break strict standards compliance
-Os: Optimize for size

You should typically ensure that -O2 is passed to the compiler (either on the command-line or within the makefile/project) when compiling for release, to take advantage of compiler optimisations. If you’re not sure whether they’re being used, look out for them in the compile log.

-O3 and -Ofast can provide mixed results, as they use the most aggressive optimisations which can have inadvertent effects such as greatly increasing the binary size for limited to no further impact to performance.

CMake

CMake supports Debug and Release build configurations via the CMAKE_BUILD_TYPE argument at configure time.

cmake -DCMAKE_BUILD_TYPE=Release .
cmake --build .

If the CMake project is correctly structured (it may be worth reviewing the compilation log to make sure you can see an optimisation flag), passing Release will enable compiler optimisations . It’s up to the project maintainer in this scenario, whether the default configuration corresponds to Debug, Release or something else entirely.

Visual Studio (MSVC)

The default project structure within visual studio will contain both a Debug and a Release build configuration, the latter will have compiler optimisations enabled and should be used for production builds.

To enable compiler optimisations manually, navigate to Project Properties > C/C++ > Optimization and set this value to /O2 (Maxmimise speed) or /Ox (Full optimisation).

The Technical Detail

There are many different optimisations which can be performed by a compiler, GCC lists most of their compiler optimisations and allows them to be individually toggled. Passing -Q --help=optimizers to gcc, g++, or gfortran will list the exact set of optimisations enabled by the chosen optimisation level.

Loop ordering

Subcategory: Core

In Fortran, MATLAB, and R, arrays are stored in column-major order, meaning that the entries from the left-most index are stored contiguously in memory. This is often referred to as the left-most index varying quickest. This is different from many other popular scientific programming languages such as C, C++, and Python, which use a row-major ordering. The implication for writing loops is that it is always best to order loops by which indices vary quickest. In this way, the loop nest will traverse contiguous memory, which can be done efficiently.

In Fortran, MATLAB, and R, arrays are stored in column-major order, meaning that the entries from the left-most index are stored contiguously in memory. This is often referred to as the left-most index varying quickest. This is different from many other popular scientific programming languages such as C, C++, and Python, which use a row-major ordering. The implication for writing loops is that it is always best to order loops by which indices vary quickest. In this way, the loop nest will traverse contiguous memory, which can be done efficiently.

Example Code

Consider the following Fortran example involving a loop over a 3D array, which uses the recommended column-major ordering:

integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c
integer :: i, j, k

! Initialisation of a and b omitted

! Hand-coded c = a + b using column-major
do k = 1, n
  do j = 1, n
    do i = 1, n
      c(i,j,k) = a(i,j,k) + b(i,j,k)
    end do
  end do
end do

Without compiler optimisations, this will likely run several times faster than the (not recommended) equivalent row-major version:

integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c
integer :: i, j, k

! Initialisation of a and b omitted

! Hand-coded c = a + b using row-major
do i = 1, n
  do j = 1, n
    do k = 1, n
      c(i,j,k) = a(i,j,k) + b(i,j,k)
    end do
  end do
end do

With compiler optimisations, simple loops such as this where the result of the sum are unused would however likely be optimised out by the compiler. In practice you wouldn’t normally have a redundant loop in your code, so this shouldn’t be a concern.

A further comment on the simple loops above is that the same could be achieved with an array operation, which has a more compact notation:

integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c

! Initialisation of a and b omitted

! Compute c = a + b with an array operation
c(:,:,:) = a + b

Benchmark

Below these examples have been extended into a full benchmark, to demonstrate the impact. An additional checksum is used to ensure the code we’re benchmarking isn’t optimised away by the compiler.

Benchmark Code

```fortran program benchmark_loops implicit none integer, parameter :: n = 1000 real, allocatable :: a(:,:,:), b(:,:,:), c(:,:,:) real :: t1, t2 integer :: i real :: checksum allocate(a(n,n,n), b(n,n,n), c(n,n,n)) ! Initialize data a = 1.0 b = 2.0 c = 0.0 ! Warm-up runs (optional but recommended) call col_major_add(a, b, c, n) call row_major_add(a, b, c, n) ! Benchmark column-major loop order call cpu_time(t1) call col_major_add(a, b, c, n) call cpu_time(t2) print *, "Column-major time (s): ", t2 - t1 checksum = sum(c) print *," checksum: ", checksum ! Reset result array to avoid reuse artifacts c = 0.0 ! Benchmark row-major loop order call cpu_time(t1) call row_major_add(a, b, c, n) call cpu_time(t2) print *, "Row-major time (s): ", t2 - t1 checksum = sum(c) print *," checksum: ", checksum contains subroutine col_major_add(a, b, c, n) integer, intent(in) :: n real, intent(in) :: a(n,n,n), b(n,n,n) real, intent(out) :: c(n,n,n) integer :: i, j, k do k = 1, n do j = 1, n do i = 1, n c(i,j,k) = a(i,j,k) + b(i,j,k) end do end do end do end subroutine col_major_add subroutine row_major_add(a, b, c, n) integer, intent(in) :: n real, intent(in) :: a(n,n,n), b(n,n,n) real, intent(out) :: c(n,n,n) integer :: i, j, k do i = 1, n do j = 1, n do k = 1, n c(i,j,k) = a(i,j,k) + b(i,j,k) end do end do end do end subroutine row_major_add end program benchmark_loops ```

Compiled with GFortran and compiler optimisations enabled, column-major accesses were 2.4x faster under Ubuntu!

$ gfortran -O3 benchmark.f90
$ ./a.out
 Column-major time (s):    3.98375320    
  checksum:    67108864.0    
 Row-major time (s):       9.72134018    
  checksum:    67108864.0

Under WSL on different hardware a greater 7.1x speedup was seen, so results are likely to vary, but should remain positive.

$ gfortran -O3 benchmark.f90
$ ./a.out
 Column-major time (s):   0.440809250
  checksum:    67108864.0
 Row-major time (s):       3.11280060
  checksum:    67108864.0

Technical Details

The underlying reason that order has such a big impact here is due to how computer memory operates. When a processor loads a variable from RAM into it’s caches, it doesn’t load an individual variable (likely 4 bytes), it loads a full cache line (likely 64 bytes).

Therefore, variables stored consecutively in memory will be loaded at the same time if they fall within the same cache line. This means the processor does not need to go to RAM to access the variable, which is orders of magnitude higher latency. Due to how Fortran lays out memory, column-major ordering achieves this.

In contrast, if you operate as though memory is row-major, consecutively accessed variables will be stored a long way apart from each other in memory. Hence each individual access will need to load a full cache line from RAM. Due to this high turnover, it’s likely by the time a second variable would be accessed from any cache line, that the cache line has been evicted to be replaced with a fresher load.

If you’re working with 4-byte types you could be doing 16x the RAM accesses, with 8-byte types it’s still 8x!

Furthermore, in order for the compiler to perform vectorisation, it needs to be able to apply vector instructions to a full cache line of variables. If you are not operating consecutively on variables within a cache line, it can be much harder for the compiler to detect that vectorisation is appropriate.

Reshape

Subcategory: Core

It’s often required to reshape arrays used in Fortran code. This can be achieved in several ways, the most naive of which is to use hand-coded do loops. This is not recommended as it is error-prone and verbose. A better approach is to use the intrinsic reshape function, which is concise and clear.

The most efficient way to reshape arrays is to use pointers. This is a more advanced approach and care must be taken to ensure that (a) the original array is not deallocated while the pointer is still in use and (b) you are aware that modifications to the “reshaped” array will modify the original array and vice versa because they share the same memory.

It’s often required to reshape arrays used in Fortran code. This can be achieved in several ways, the most naive of which is to use hand-coded do loops. This is not recommended as it is error-prone and verbose. A better approach is to use the intrinsic reshape function, which is concise and clear.

The most efficient way to reshape arrays is to use pointers. This is a more advanced approach and care must be taken to ensure that (a) the original array is not deallocated while the pointer is still in use and (b) you are aware that modifications to the “reshaped” array will modify the original array and vice versa because they share the same memory.

The pack and unpack intrinsics can also be used to reshape arrays but this is not their intended purpose and they are less efficient than reshape at achieving this.

It is also possible to do implicit reshaping of arrays by passing them to procedures with different interface shapes. This is generally not recommended as it can lead to confusing code and requires compiler flags such as -fallow-argument-mismatch in the case of gfortran.

Example Code

The 1D to 2D reshape coded by hand in

integer, parameter :: m = 3, n = 4
integer, parameter :: mn = m * n
real , dimension(mn) :: arr1d
real , dimension(m, n) :: arr2d

! Initialisation of arr1d omitted

! Hand-coded reshape
do j = 1, n
  do i = 1, m
    arr2d(i, j) = arr1d((i - 1) * n + j)
  end do
end do

can be implemented using reshape as simply

arr2d = reshape(arr1d, shape(arr2d))

To implement this using pointers, use the c_f_pointer subroutine and c_loc function from the iso_c_binding module (intrinsic since Fortran 2003). The c_f_pointer subroutine associates a C pointer with a Fortran pointer, which allows you to “reshape” the array simply by aliasing (without copying data).

use iso_c_binding, only: c_f_pointer, c_loc

integer, parameter :: m = 3, n = 4
integer, parameter :: mn = m * n
real, target, dimension(mn) :: arr1d
real, pointer :: arr2d(:,:)

! Initialisation of arr1d omitted

! Associate arr2d with arr1d, reshaping it to (m, n)
call c_f_pointer(c_loc(arr1d), arr2d, [m, n])

! Use arr2d as needed

! Don't forget to nullify the pointer when it's no longer needed
nullify(arr2d)

Use BLAS/LAPACK Instead of Reimplementing Maths Functions

Subcategory: Maths

BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) are widely used open-source libraries for doing maths, especially matrix operations. They are highly optimised, often for specific CPU architectures by the vendors (Intel, AMD, etc.) themselves, and can be found on all modern HPC systems. Therefore, they are significantly faster than anything we can write ourselves, especially for larger matrices, and should be used in most circumstances.

BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) are widely used open-source libraries for doing maths, especially matrix operations. They are highly optimised, often for specific CPU architectures by the vendors (Intel, AMD, etc.) themselves, and can be found on all modern HPC systems. Therefore, they are significantly faster than anything we can write ourselves, especially for larger matrices, and should be used in most circumstances.

The BLAS and LAPACK standards are originally defined using Fortran, so using them is a little bit cumbersome:

The function signature for any BLAS/LAPACK function to be used has to be first declared.
If using C/C++, the function name must have an underscore, e.g. dgemm is used as dgemm_.
All arguments must be passed by reference.
Matrices must be contiguous arrays in column-major order (more on this below).

Other than that, using them is very straightforward — all we need to do is call a function in our code, and then link against BLAS when compiling our code, e.g.:

gcc example.c -o example -lopenblas

or in CMake:

find_package(BLAS REQUIRED)

add_executable(example example.c)
target_link_libraries(example ${BLAS_LIBRARIES})

IMPORTANT: When using BLAS/LAPACK, make sure you don’t link against the reference BLAS (this is the reference implementation and has not been optimised). A good shorthand is to use OpenBLAS, but depending on your CPU etc., other implementations may be faster. Check which implementations are available on your HPC system and ask your local friendly HPC administrators if unsure.

Example

This is a small benchmark comparing a hand coded matrix multiplication versus BLAS’s implementation:

#include <cstdlib>
#include <cstdio>
#include <chrono>

/**
 * Crude random matrix generator
 */
double* generate_matrix(int m, int n) {
  double* a = (double*)malloc(m * n * sizeof(double));
  for (int i = 0; i < m * n; ++i)
    a[i] = (double)rand() / RAND_MAX;
  return a;
}

/**
 * Manually implemented matrix multiplication
 */
void multiply(const double *matrix1, int m1, int n1, const double *matrix2, int m2, int n2, double *result) {
  for (int i = 0; i < m1; i++) {
    for (int j = 0; j < n2; j++) {
      result[i + j * m1] - 0.0;

      for (int k = 0; k < n1; k++) {
        result[i + j * m1] += matrix1[i + k * m1] * matrix2[k + j * m2];
      }
    }
  }
}

/**
* BLAS's matrix multiplication prototype
*/
extern "C" {
void dgemm_(const char*, const char*, const int*, const int*, const int*, const double*, const double*, const int*, const double*, const int*, const double*, double*, const int*);
}

int main() {
  const int m = 1000, n = 1000;
  const int REPEATS = 10;

  double *matrix1 = generate_matrix(m, n);
  double *matrix2 = generate_matrix(m, n);
  double *result_m = (double*)malloc(m * n * sizeof(double));
  double *result_b = (double*)malloc(m * n * sizeof(double));

  // Benchmark manual version
  auto start_m = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < REPEATS; ++i)
    multiply(matrix1, m, n, matrix2, m, n, result_m);
  auto end_m = std::chrono::high_resolution_clock::now();
  
  std::chrono::duration<double> elapsed_m = end_m - start_m;
  printf("Manual took an average of %f seconds.\n", elapsed_m.count()/REPEATS);
  
  // Benchmark BLAS version
  const double one = 1.0, zero = 0.0;
  auto start_b = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < REPEATS; ++i)
    dgemm_("N", "N", &m, &n, &n, &one, matrix1, &m, matrix2, &m, &zero, result_b, &m);
  auto end_b = std::chrono::high_resolution_clock::now();
  
  std::chrono::duration<double> elapsed_b = end_b - start_b;
  printf("BLAS took an average of %f seconds.\n", elapsed_b.count()/REPEATS);

  free(matrix1); free(matrix2); free(result_m); free(result_b);
}

Tested on local HPC, with 16-cores, using OpenBLAS:

module load OpenBLAS
g++ bench.cpp -O3 -o bench -lopenblas
./bench

Output:

Manual took an average of 1.183420 seconds.
BLAS took an average of 0.002624 seconds.

That’s a 450x speed-up!

CBLAS

The original BLAS interface expects column-major matrices, CBLAS can be used under C/C++ to process row-major matrices which may even perform faster due to improved memory access patterns.

Include <cblas.h> and prepend cblas_ to the method name. Note that char arguments have been replaced with enums and matrix functions have the additional layout argument, so the exact parameters passed to the function are likely to require updating.

/**
 * CBLAS include
 */
#include <cblas.h>

int main() {
...
  // Benchmark CBLAS version
  auto start_cb = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < REPEATS; ++i)
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, n, 1.0, matrix1, n, matrix2, n, 0.0, result_cb, n);
  auto end_cb = std::chrono::high_resolution_clock::now();
  
  std::chrono::duration<double> elapsed_cb = end_cb - start_cb;
  printf("CBLAS took an average of %f seconds.\n", elapsed_cb.count()/REPEATS)
}

Which reports a further 20% speedup!

CBLAS took an average of 0.002022 seconds.

GPU

Even better, modern BLAS and LAPACK implementations are now available for GPUs, allowing us to take advantage of these accelerators with minimal effort (though unfortunately the interface is often different):

NVIDIA has cuBLAS (though not LAPACK implementation, unfortunately)
AMD has rocblas/hipblas
There are also cross-platform (work on both NVIDIA and AMD hardware) implementations like magmaBLAS

Technical Explanation

There is a surprising amount of low-level optimisation possible when writing linear algebra functions - writing a simple double or triple loop to perform the operation is only the beginning, and the compiler can only do so much. One of the basic optimisations is computing the operation in blocks, but the block size needs to be optimised to the cache sizes available on a specific CPU. However, the HPC implementations of these libraries go way further than that, even going as far as writing the innermost computations directly in assembly. Furthermore, many HPC implementations are also parallelised with OpenMP, so using them gives us some parallelism “for free”.

Move invariant conditional out of the loop to facilitate vectorisation

Subcategory: Core

Many Fortran compilers will attempt to automatically vectorise a loop if possible. There are several bad practices which can inhibit this automatic vectorisation. One such pattern is when a conditional evaluates to the same value for all loop iterations and can be moved outside the loop. In this scenario, not only are we redundantly evaluating the same conditional but we are also inhibiting automatic vectorisation.

Many Fortran compilers will attempt to automatically vectorise a loop if possible. There are several bad practices which can inhibit this automatic vectorisation. One such pattern is when a conditional evaluates to the same value for all loop iterations and can be moved outside the loop. In this scenario, not only are we redundantly evaluating the same conditional but we are also inhibiting automatic vectorisation.

Example Code

C/C++

The following code shows an example of an invariant conditional inside a loop within C/C++

int example(int *A, int n) {
  int total = 0;

  for (int i = 0; i < n; ++i) {
    if (n < 10) {
      total++;
    }
    A[i] = total;
  }

  return total;
}

The loop invariant can be extracted out of the loop by duplicating the loop body and removing the condition. The resulting code is as follows:

int example(int *A, int n) {
  int total = 0;

  if (n < 10) {
    for (int i = 0; i < n; ++i) {
      A[i] = ++total;
    }
  } else {
    for (int i = 0; i < n; ++i) {
      A[i] = total;
    }
  }

  return total;
}

Fortran

The following code shows an example of an invariant conditional inside a loop within Fortran

pure subroutine example(array)
  integer, intent(out) :: array(:)
  integer :: i, total

  total = 0

  do i = 1, size(array, 1)
    if (size(array, 1) < 10) then
      total = total + 1
    end if
    array(i) = total
  end do
end subroutine example

The loop invariant can be extracted out of the loop by duplicating the loop body and removing the condition. The resulting code is as follows:

pure subroutine example(array)
  integer, intent(out) :: array(:)
  integer :: i, total

  total = 0

  if (size(array, 1) < 10) then
    do i = 1, size(array, 1)
      total = total + 1
      array(i) = total
    end do
  else
    do i = 1, size(array, 1)
      array(i) = total
    end do
  end if
end subroutine example

Benchmarks

Benchmarks written using the above example (compiling with GNU Fortran (Homebrew GCC 15.2.0) 15.2.0) have shown a performance improvement of more than 50% without optimisation flags (-O0) and an improvement of less than 1% with higher optimisations (e.g. -O3).

References

Examples have been reproduced from the codee open-catalog - PWR022.