If you’ve profiled your code and reached a point of reasonable performance, you may want to consider whether it can be sped up further with parallelism. There are several different forms of parallelism that can be used, however each has different trade-offs which may make them unsuitable for your code, or too challenging to utilise without advanced skills.

The following parallel forms are outlined below:

It’s recommended that you read all four sections, before deciding which to pursue.

Vectorisation

Vectorisation is a form of “Single Instruction Multiple Data” (SIMD) parallelism, whereby special vector instructions perform the same operation on multiple numbers simultaneously. To achieve this, there needs to be enough data to fill the processor’s vector registers (e.g. Intel AVX(2)’s are 256bit, either 8 float, or 4 double), stored in contiguous buffers and the compiled code needs to call the appropriate vectorisation instructions.

As long as data is formatted correctly there’s no overhead to utilising vectorisation, so vectorising even a single instruction should improve performance. Likewise, the simplicity of vectorisation can lead to the parallel speed-up of the vectorised code being very close to linear. This kind of parallelism can greatly improve the performance of large numeric codes, where mathematical operations are applied across large buffers and data-tables.

Modern C/C++ compilers try to auto-vectorise appropriate code, but they’re far from fool-proof, frequently missing opportunities (different compilers may even find different auto-vectorisation opportunities). If you’re writing C/C++ it is possible to explicitly call vectorisation directly with assembly or std::simd, but this isn’t encouraged for typical users.

Instead, if you’re using OpenMP you can take advantage of #pragma OMP simd. When applied to appropriate for loops, it provides a strong hint to auto-vectorise and may provide compilation warnings where it fails.

More broadly, if you’re using high-performance mathematical libraries (such as NumPy, LAPACK, BLAS etc) you’re probably already taking advantage of vectorisation. The developers of high-performance libraries do all the hard work in ensuring vector instructions are used, you just need to check the relevant library’s documentation to ensure you’re following their best practices.

Multi-threading

Modern processors have multiple-core architectures, each processor core is able to operate independently. Early multi-core processors were only dual-core, but now processors typically have 8 or 16 cores (even as high as 96 in some cases). However all cores are not equal, their performance can differ substantially. A modern processor may have; performance (p-cores), efficient (e-cores) and logical cores (hyper-threading/simultaneous multi-threading).

It is very difficult to achieve a linear parallel speed-up with CPU multi-threading, even if logical and e-cores are ignored. CPUs have limited hardware resources such as caches, it is rare for parallel workloads to be balanced such that hardware bottlenecks are not reached. Furthermore, increased activity leads to increased thermal output which may increase thermal throttling of the processor.

Regardless if you wish to utilise multi-threading the rule for data storage is the opposite to that for vectorisation. Parallel threads should operate on variables that are stored in different cache lines. When processors load memory, they load a full 64-byte line from RAM into their cache, rather than individual 4-byte floats. That cache line is then copied into a processor core’s smaller local cache. If multiple cores are operating on the same cache line in parallel, they will keep invalidating each other’s copy, forcing otherwise redundant reloads. This can greatly impact performance. Vectorisation can be combined with multi-threading, the cache-line and vector buffer lengths are typically compatible.

The easiest way to take advantage of multi-threading is via packages such as OpenMP (C/C++/Fortran), multiprocessing (Python) and parallel (R). Furthermore, modern programming languages increasingly offer simple drop-in replacements to parallelise tasks such as for loops and other operations on arrays/data-tables (it can be worth checking documentation).

These abstractions reduce the need to thoroughly understand threads and other parallel concepts. Instead they simply requiring you to decide whether your workload is data parallel (e.g. an operation applied to all elements of an array or data-table, often in the form of a for loop) or task parallel (e.g fully independent jobs) and to identify race-conditions (where two threads operate on the same data). Small amounts of restructuring or additional code can then introduce multi-threading.

Further to the above reasons, performance of multi-threading via these methods is further limited by the overhead of creating and destroying parallel environments and how loads are balanced between threads. In order to maximise performance, again the relevant package’s documentation should be reviewed to ensure best practices are followed.

There are many common pitfalls, such as false sharing, which can lead to a disappointing speedup when OpenMP is added to a codebase. Make sure to review our OpenMP optimisation database to avoid them!

GPU

GPU parallelism requires distinct accelerator hardware suitable for some highly parallel tasks, in contrast to a CPU’s upto 100 parallel cores, GPUs require 3 order of magnitude more threads to be fully saturated (e.g. 100,000 threads).

GPUs have much in common with CPUs vector instructions, however GPUs operate slower than CPUs and must be driven by an application running on a CPU. With the exception of NVIDIA’s GraceHopper superchip, GPUs do not share memory with the main CPU, this adds an additional latency when copying data to or from a GPU. Despite these drawbacks, due to the massive parallel scale GPUs operate at they will often outperform CPUs with suitably large parallel workloads.

Most GPU parallel code to-date has been written using NVIDIA’s proprietary CUDA framework. NVIDIAs early entry to the field of GPGPU, with over a decade of limited competition led to all substantial GPU parallel libraries being built using CUDA. This is now gradually changing, AMD (and Intel) have both entered the accelerated compute market, and increasingly code is being written using AMDs ROCM and HIP, and SyCL (the spiritual successor to OpenCL). However it remains that most GPU parallel code is specific to a particular brand’s GPUs, so it’s worth considering which GPUs you have access to before progressing further.

Writing efficient code to target GPUs directly is challenging. GPUs are a specialist architecture, without a thorough understanding of this architecture there are many pitfalls that will harm performance. Whilst code targetting GPUs is written in C++, much of the C++ standard library such as all the common data structures is unavailable. Along with the need to manually transfer data to the GPUs memory, this can lead to any non trivial code to require significant rewrites to appropriately utilise GPU compute.

Similar to OpenMP for multithreading, OpenACC and modern OpenMP’s offload functionality (OpenMP 4.0+) can be used to convert existing C/C++/Fortran code to utilise GPU compute. Due to the complexity of GPU compute there are additional considerations to be managed when using these, such as data movement. If data is regularly copied back and forth between CPU and GPU with only a small amount of computation on the GPU, the data movement overhead will quickly exceed time spent performing compute.

Many libraries also provide abstracted access to GPU compute. NVIDIA provides cuPyNumericand cuDF.pandas, drop-in replacements for the Python packages NumPy and Pandas. Other scientific libraries

Distributed (HPC)

Distributed computing is where compute is distributed over multiple nodes (computers). This is typically available via HPC clusters, although also historically utilised volunteers spare compute in the form of SETI@Home and Folding@Home.

There are two common approaches for utilising distributed compute, Message Passing Interface (MPI) and job or task arrays

MPI is an advanced API for C/C++/Fortran programmers that allows programmers to distribute workloads across multiple processes, whether they are in the same computer or distributed over a network. You’re unlikely to want to write MPI code directly, however it’s possible if you’re using scientific compute packages that they have MPI support. HPC systems that support multi-node jobs, should provide an MPI implementation such as OpenMPI or MPICH that can be used.

In contrast job arrays (SLURM terminology) are a straight forward technique to split up a HPC job that contains multiple independent units, so that they can be executed in parallel across a cluster. The maximum number of tasks one can be split into will vary depending on the HPC system, it could be anywhere from 50 to 1000. It’s recommended to ensure that each individual task is still a substantial workload, otherwise the scheduling cost will outweigh that of the job itself.

A SLURM task array is launched via --array=0-31 which would submit a job 32 times, with ID’s [0-31] (inclusive-inclusive). Within the job script it is necessary to use the tasks’s unique ID to configure it’s specific task, this could be as simple as using it’s ID as an index into a list of input configurations.

SLURM will provide the job script with 6 relevant variables at execution:

# The common job ID shared by all tasks in the array
SLURM_JOB_ID
# The unique per task job ID
SLURM_ARRAY_JOB_ID
# The unique ID of the task, according to what was passed via --array
SLURM_ARRAY_TASK_ID
# The total number of tasks
SLURM_ARRAY_TASK_COUNT
# The maximum unique task ID
SLURM_ARRAY_TASK_MAX
# The minimum unique task ID
SLURM_ARRAY_TASK_MIN

The simplest example of a task array job script would be:

#!/bin/bash
#SBATCH ...           # Usual SLURM job configuration
#SBATCH --array=0-31  # Array range


# Run my program using SLURM_ARRAY_TASK_ID
./my_program -i inputs/${SLURM_ARRAY_TASK_ID}

This could be made more advanced, processing multiple jobs per task, by dynamically dividing a for loop within the job script.

Tasks arrays are a really easy way to speed up the execution of embarrassingly parallel tasks, where you have many independent jobs to be executed. It’s worth reviewing the documentation for the HPC system that you will be using, exact usage may vary slightly depending on how the scheduler is configured.