Python

Array Broadcasting (Vectorisation)

Subcategory: NumPy

The manner by which NumPy stores data in arrays enables its functions to utilise array broadcasting (more broadly known as vectorisation), whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. Array broadcasting can perform mathematical operations many times faster, however it requires using supported functions.

The manner by which NumPy stores data in arrays enables its functions to utilise array broadcasting (more broadly known as vectorisation), whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. Array broadcasting can perform mathematical operations many times faster, however it requires using supported functions.

Auto Parallel NumPy

Additionally, NumPy functions which support array broadcasting can sometimes take advantage of auto parallelisation, particularly on HPC systems. These functions are typically backed by BLAS and LAPACK, it’s not well documented which functions support auto parallelisation, but they mostly correspond to linear algerbra operations.

The auto-parallelisation of these functions is hardware dependent, so you won’t always automatically get the additional benefit of parallelisation. However, HPC systems should be primed to take advantage, so try increasing the number of cores you request when submitting your jobs and see if the performance improves.

`vectorize()`

NumPy provides a vectorize() function for operating over its arrays.

This doesn’t actually make use of processor-level vectorisation, so won’t afford a speed up. From NumPy’s documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

Example Code

The below example computes the dot product of an array length 1 million.

from timeit import timeit

N = 1000000  # Number of elements in list

gen_list = f"ls = list(range({N}))"
gen_array = f"import numpy;ar = numpy.arange({N}, dtype=numpy.int64)"

# List comprehension to compute
py_sum_ls = "sum([i*i for i in ls])"
# Vectorised multiplication
py_sum_ar = "sum(ar*ar)"
# Vectorised multiplication and sum
np_sum_ar = "numpy.sum(ar*ar)"
# Specialised vectorised dot product
np_dot_ar = "numpy.dot(ar, ar)"
# NumPy vectorise equivalent of py_sum_ar/py_sum_ls
np_vec_ar = "sum(numpy.vectorize(lambda i: i*i)(ar))"

repeats = 1000
print(f"python_sum_list: {timeit(py_sum_ls, setup=gen_list, number=repeats):.2f}ms")
print(f"python_sum_array: {timeit(py_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_vec_array: {timeit(np_vec_ar, setup=gen_array, number=repeats):.2f}ms")

python_sum_list uses list comprehension to perform the multiplication, followed by the Python core sum(). This comes out at 81.65ms
python_sum_array instead directly multiplies the two arrays, taking advantage of NumPy’s vectorisation. But uses the core Python sum(), this comes in slightly faster at 58.27ms.
numpy_sum_array again takes advantage of NumPy’s vectorisation for the multiplication, and additionally uses NumPy’s sum() implementation. These two rounds of vectorisation provide a much faster 2.68ms completion.
numpy_dot_array instead uses NumPy’s dot() to calculate the dot product in a single operation. This comes out the fastest at 0.46ms, about 177x faster than python_sum_list.
numpy_vec_array uses NumPy’s vectorize(). This was the slowest of the lot at 220.95ms, that’s roughly 2.7x slower than python_sum_list, likely equivalent to using a Python for loop.

python_sum_list: 81.65ms
python_sum_array: 58.27ms
numpy_sum_array: 2.68ms
numpy_dot_array: 0.46ms
numpy_vec_array: 220.95ms

The Technical Detail

The vector instructions, which NumPy’s array broadcasting take advantage of, enable a CPU to apply the same operation to multiple data elements in parallel with a single thread.

Modern CPUs use SIMD (Single Instruction, Multiple Data) instructions to operate on several values packed into a register. Since a typical CPU cache line is 64 bytes, and standard data types like 32-bit or 64-bit floats and integers are 4 or 8 bytes each, 8–16 values can fit within a single cache line and be processed together by a single vector instruction.

However, to take advantage of this, the data must be aligned in memory, laid out such that it fits neatly into the expected cache line boundaries. Default memory allocations in Python don’t guarantee this kind of alignment, as doing so for all objects would waste memory.

NumPy arrays, in contrast, are explicitly designed for numerical performance. They allocate memory with alignment guarantees, ensuring that vector instructions can be used.

List Comprehension

Subcategory: Core

List comprehension (e.g. [expression for item in iterable if condition == True]) is faster than constructing a list with a loop. It can’t be used for all list constructions, such as where items depend on one another, but it should be used whenever possible.

List comprehension (e.g. [expression for item in iterable if condition == True]) is faster than constructing a list with a loop. It can’t be used for all list constructions, such as where items depend on one another, but it should be used whenever possible.

Example Code

Rather than constructing a list with append() inside a loop

squares = []
for x in range(100):
    squares.append(x**2)

Use a list comprehension

squares = [x**2 for x in range(100)]

Whilst being more concise and readable, it is also typically faster. The speed-up will depend on the length of the list, and complexity of it’s construction but is likely to be twice as fast.

List comprehension syntax can be nested to create two-dimensional lists, however using map() and zip() may create more readable code in these scenarios.

Example Benchmark

The below code provides a simple benchmark of creating a list of 100,000 consecutive integers using a three approaches.

from timeit import timeit

# Construct the list appending each item individually
def list_append():
    li = []
    for i in range(100000):
        li.append(i)

# Construct the list by preallocating it, before setting each item to the correct value
def list_preallocate():
    li = [0]*100000
    for i in range(100000):
        li[i] = i

# Construct the list using list comprehension
def list_comprehension():
    li = [i for i in range(100000)]

repeats = 1000
print(f"Append: {timeit(list_append, number=repeats):.2f}ms")
print(f"Preallocate: {timeit(list_preallocate, number=repeats):.2f}ms")
print(f"Comprehension: {timeit(list_comprehension, number=repeats):.2f}ms")

In testing this produced the following results

list_append: 3.50ms
list_preallocate: 2.48ms
list_comprehension: 1.69ms

Using list comprehension was over 2x faster than using a loop and append().

The Technical Detail

List comprehension is likely faster than using loops and append, because the syntax is more constrained allowing greater optimisation within the Python back-end.

How Lists Are Implemented

Lists are implemented as a form of dynamic array found within many programming languages by different names (C++: std::vector, Java: ArrayList, R: vector, Julia: Vector).

They allow direct and sequential element access, with the convenience to append items.

This is achieved by internally storing items in a static array. This array however can be longer than the list, so the current length of the list is stored alongside the array. When an item is appended, the list checks whether it has enough spare space to add the item to the end. If it doesn’t, it will re-allocate a larger array, copy across the elements, and deallocate the old array. The item to be appended is then copied to the end and the counter which tracks the list’s length is incremented.

The amount the internal array grows by is dependent on the particular list implementation’s growth factor. CPython for example uses newsize + (newsize >> 3) + 6, which works out to an over allocation of roughly ~12.5%.

A graph demonstrating the number of resizes required based on the number of appends to a list within CPython. The plot is almost vertical for x < 100,000, before tapering off.
The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.

This has two implications:

If you are creating large static lists, they will use upto 12.5% excess memory.
If you are growing a list with append(), there will be large amounts of redundant allocations and copies as the list grows.

Both of which will result in slower code.

Searching with a for-loop

Subcategory: Core

Searching for an element in a sequence (e.g. a list) with a for-loop and an equality check is very natural, but Python has a built-in in keyword which should be used whenever possible.

Set

Subcategory: Core

Similar to the mathematical concept of a set, Python (and most other languages) provides a data structure set which is an unordered collection of unique values. Using a set is the fastest way to detect unique items (e.g. if a in my_set) or remove duplicates (e.g. [x for x in set(my_list)]).

Similar to the mathematical concept of a set, Python (and most other languages) provides a data structure set which is an unordered collection of unique values. Using a set is the fastest way to detect unique items (e.g. if a in my_set) or remove duplicates (e.g. [x for x in set(my_list)]).

Example Code

Rather than using a list to build a unique collection of items

import random
# Create a list of 1000 random numbers in the range [0, 2000)
list_out = []
while len(list_out) < 1000:
    t = random.randint(0, 2000)
    if not t in list_out:
        list_out.append(t)

Use a set

import random
# Create a set of 1000 random numbers in the range [0, 2000)
set_out = {}
while len(set_out) < 1000:
    t = random.randint(0, 2000)
    set_out.add(t)

Depending on the size of the collection, the order of items, and the proportion of duplicates, using a set could be thousands of times faster than using a list.

Example Benchmark

The below code provides a simple benchmark of removing duplicates from a list of 25000 random integers using a list or set.

import random
from timeit import timeit

# A simple method to generate us a consistent input list
def generateInputs(N = 25000):
    random.seed(12)  # Ensure every list is the same 
    return [random.randint(0,int(N/2)) for i in range(N)]

# Pass the list directly to the set constructor
def uniqueSet():
    ls_in = generateInputs()
    set_out = set(ls_in)

# Iterate the list, adding each item to the set individually
def uniqueSetAdd():
    ls_in = generateInputs()
    set_out = set()
    for i in ls_in:
        set_out.add(i)

# Iterate the list, adding each unique item to the new list individually
def uniqueList():
    ls_in = generateInputs()
    ls_out = []
    for i in ls_in:
        if not i in ls_out:
            ls_out.append(i)

# Sort the input list, add each item if it does not match the last item of the new list
def uniqueListSort():
    ls_in = generateInputs()
    ls_in.sort()
    ls_out = [ls_in[0]]
    for i in ls_in:
        if ls_out[-1] != i:
            ls_out.append(i)
            
repeats = 1000
gen_time = timeit(generateInputs, number=repeats)
print(f"uniqueSet: {timeit(uniqueSet, number=repeats)-gen_time:.2f}ms")
print(f"uniqueSetAdd: {timeit(uniqueSetAdd, number=repeats)-gen_time:.2f}ms")
print(f"uniqueList: {timeit(uniqueList, number=repeats)-gen_time:.2f}ms")
print(f"uniqueListSort: {timeit(uniqueListSort, number=repeats)-gen_time:.2f}ms"

In testing this produced the following results

uniqueSet: 0.30ms
uniqueSetAdd: 0.81ms
uniqueList: 660.71ms
uniqueListSort: 2.67ms

Using the constructor for set was over 2000x times faster than iterating the unsorted list.

The Technical Detail

Set data structures are similar to dictionaries, but without the values. Internally they are typically implemented with hashing or tree data-structures. These are highly optimal for direct access to items, requiring less items to be checked to test for existence.

In contrast, performing a search through an unsorted list will require all items to be checked in the worst case whereby the item is not found. Approaches such as sorting the list and using a binary search can greatly improve performance, however in most cases using a set will be preferable.

numpy.array.resize()

Subcategory: NumPy

NumPy’s arrays are static arrays which, unlike core Python’s lists, do not dynamically resize. If you wish to append to a NumPy array, you must call resize() first, which can lead to performance issues. If a user treats array.resize like list.append, resizing for each individual append, they will be performing significantly more copies and memory allocations and hence make your code slower.

NumPy’s arrays are static arrays which, unlike core Python’s lists, do not dynamically resize. If you wish to append to a NumPy array, you must call resize() first, which can lead to performance issues. If a user treats array.resize like list.append, resizing for each individual append, they will be performing significantly more copies and memory allocations and hence make your code slower.

While resizing can be valid in some scenarios, it should generally be minimised. Ideally array.resize should be used to increase (or decrease) capacity by many elements at once (similar to how a list work internally).

Example Code

The below example sees lists and arrays constructed from range(100000).

from timeit import timeit
import numpy

N = 100000  # Number of elements in list/array

def list_comprehension():
    ls = [i for i in range(N)]

def list_append():
    ls = []
    for i in range(N):
        ls.append(i)

def array_resize():
    ar = numpy.zeros(1)
    for i in range(1, N):
        ar.resize(i+1)
        ar[i] = i

repeats = 1000
print(f"list_comprehension: {timeit(list_comprehension, number=repeats):.2f}ms")
print(f"list_append: {timeit(list_append, number=repeats):.2f}ms")
print(f"array_resize: {timeit(array_resize, number=repeats):.2f}ms")

Resizing a NumPy array was 8x slower than a list, and 13x slower than list comprehension.

list_comprehension: 6.29ms
list_append: 9.68ms
array_resize: 82.66ms

The Technical Detail

Resizing an array, allocates a new buffer in memory of the required size, copies the data across, and de-allocates the old storage.

In comparison a list, whilst backed by an array, performs greedy resizes. The length of the list visible to the user does not reflect the length of the internal array. This allows it to simply store new items and increase its length-counter when appending. Only occasionally, does it need to perform an additional greedy resize. This greatly improves the append performance, at the cost of a slight memory overhead.

For these reasons, it is more computationally efficient to use a list rather than a NumPy array when the required length is unknown or changes frequently, since the latter would necessitate frequent and costly calls to resize().

Tuple

Subcategory: Core

In addition to lists, Python has the concept of tuples. These are immutable static arrays, they cannot be resized, nor can their elements be changed. Their potential use-cases are greatly reduced due to these two limitations, as they are only suitable for groups of immutable properties.

Tuples will typically allocate several times faster than lists with equal contents, therefore they are an ideal replacement if immutable lists are being created thousands of times (tuples are unlikely to make a huge difference to any individual list allocation).

In addition to lists, Python has the concept of tuples. These are immutable static arrays, they cannot be resized, nor can their elements be changed. Their potential use-cases are greatly reduced due to these two limitations, as they are only suitable for groups of immutable properties.

Tuples will typically allocate several times faster than lists with equal contents, therefore they are an ideal replacement if immutable lists are being created thousands of times (tuples are unlikely to make a huge difference to any individual list allocation).

Python caches a large number of short (1-20 element) tuples, this greatly reduces the cost of creating and destroying them during execution compared to lists.

A tuple is constructed with ( ), in contrast to [ ] used by a list, they can also be constructed with list comprehension mechanics. They can still be joined with the + operator, similar to appending lists, however the result is always a newly allocated tuple.

Example Code

This can be easily demonstrated with Python’s timeit module in your console.

# List of length 6
>python -m timeit "li = [0,1,2,3,4,5]"
5000000 loops, best of 5: 69.4 nsec per loop

# Tuple of length 6
>python -m timeit "tu = (0,1,2,3,4,5)"
20000000 loops, best of 5: 17.4 nsec per loop

# List of length 16
>python -m timeit "li = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]"
5000000 loops, best of 5: 90.4 nsec per loop

# Tuple of length 16
>python -m timeit "tu = (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)"
20000000 loops, best of 5: 17.6 nsec per loop

It takes 4-5x as long to allocate a list than a tuple of equal length. This gap grows with the length, the tuple allocation cost remains roughly static whereas the cost of allocating the list grows slightly.

# List length 2000 via comprehension
>python -m timeit "li = [i for i in range(2000)]"
5000 loops, best of 5: 66 usec per loop

# Tuple length 2000 via comprehension
>python -m timeit "tu = (i for i in range(2000))"
500000 loops, best of 5: 728 nsec per loop

In this larger example using comprehension syntax, the tuple constructs 90x faster.

The Technical Detail

Tuples are a simpler data-structure than lists, this likely means that in addition to Python pre-caching their storage, there is less internal meta-data to setup during initialisation.

If you have time to investigate this further by checking the CPython source, please let us know!

Use Numba to precompile and optimise Python functions

Subcategory: Numba

Numba is an open-source Just In Time (JIT) compiler which converts Python functions into optimised machine code at runtime. At its simplest, it can be invoked by simply adding the @njit decorator from the numba package before your function declarations.

Numba is an open-source Just In Time (JIT) compiler which converts Python functions into optimised machine code at runtime. At its simplest, it can be invoked by simply adding the @njit decorator from the numba package before your function declarations.

Unless cached, the first time a function decorated with @njit is executed it will also incur the small compilation cost. Often this is still faster than the original Python code. Subsequent calls to the same function, will run the compiled function from Numba’s cache and execute even faster.

Numba can’t compile all Python code, it will sometimes lead to an error, but with it being so quick to try with a potential for high speedups. If profiling identifies an expensive function or block of code, maybe try wrapping it with Numba.

Parallel

It also supports more advanced optimisations, especially when combined with NumPy, such as parallel loops if your iterations are independent.

This can be enabled by extending the decorator to @njit(parallel=True).

Additionally loops that are independent should be updated to use prange, also imported from numba, rather than range. This Numba feature is used to tell Numba which loops can be parallelised safely.

⚠️ Caution when parallelising loops

If multiple threads access the same variable or array element, and at least one thread modifies it (a race condition), the result may be incorrect.

Numba will not warn you in this case. Ensure that loop iterations are independent, or you may need to manually rewrite your function using Numba’s parallel primitives to handle the race condition safely.

Extra Packages

You may find that Numba requires additional packages. For example if compiling certain NumPy functions it will require scipy to be available to Python. If it’s not available, you will receive an error, e.g.

ImportError: scipy 0.16+ is required for linear algebra

Example Code

This can be easily demonstrated with 3 versions of the same function below, which includes a basic reduction on the variable w.

import numpy as np
from timeit import timeit

# Import the njit decorator
from numba import njit, prange

# Number of points and dimensions
N = 50000
D = 100

# Random 3D points
A = np.random.rand(N, D)
B = np.random.rand(N, D)

# Pure Python version
def python_pairwise(A, B):
    N = A.shape[0]
    out = np.empty(N)
    for i in range(N):
        dist = 0.0
        for j in range(D):
            diff = A[i, j] - B[i, j]
            dist += diff ** 2
        out[i] = np.sqrt(dist)
    return out

# Serial Numba
@njit
def njit_pairwise(A, B):
    N, D = A.shape
    out = np.empty(N)
    for i in range(N):
        dist = 0.0
        for j in range(D):
            diff = A[i, j] - B[i, j]
            dist += diff ** 2
        out[i] = np.sqrt(dist)
    return out

# Parallel Numba
@njit(parallel=True)
def njit_pairwise_parallel(A, B):
    N, D = A.shape
    out = np.empty(N)
    for i in prange(N):  # <-- independent iterations
        dist = 0.0
        for j in range(D):
            diff = A[i, j] - B[i, j]
            dist += diff ** 2
        out[i] = np.sqrt(dist)
    return out

repeats=10
# Benchmark the original Python version
print(f"python: {timeit(lambda: python_pairwise(A, B), number=repeats):.2f}s")

# Benchmark the serial Numba version
print(f"njit_first: {timeit(lambda: njit_pairwise(A, B), number=1):.2f}s")
print(f"njit: {timeit(lambda: njit_pairwise(A, B), number=repeats):.2f}s")

# Benchmark the parallel Numba version
print(f"njit_parallel_first: {timeit(lambda: njit_pairwise_parallel(A, B), number=1):.2f}s")
print(f"njit_parallel: {timeit(lambda: njit_pairwise_parallel(A, B), number=repeats):.2f}s")

The njit benchmarks are split into a first call, followed by the actual benchmark. This allows the cost of the first-run’s compilation to be identified.

python took 21.07s to complete.
njit_first took 0.71s, approximately 0.66ms of which was compilation time.
njit subsequently averaged 0.05s, which is over 400x faster than the original Python.
njit_parallel_first had a similar compile time of 0.61s.
njit_parallel took 0.02s, over 1000x faster!

python: 21.70s
njit_first: 0.71s
njit: 0.05s
njit_parallel_first: 0.63s
njit_parallel: 0.02s

The Technical Detail

CPython is an interpreted language, when you first run a Python script it is typically compiled to Python bytecode (which ends up in __pycache__). The CPython interpreter, written in C, then quickly processes the bytecode.

In contrast compiled languages like C, which many high performance Python libraries are written in, compile to machine code which is able to run directly on the processor. This incurs less overhead, as the processor executes the code directly, rather than the processor executing code, to run the code.

Numba’s njit effectively attempts to compile decorated Python functions to C, so that they can take advantage of this speedup too!