In Fortran, MATLAB, and R, arrays are stored in column-major order, meaning that
the entries from the left-most index are stored contiguously in memory. This is
often referred to as the left-most index varying quickest. This is different
from many other popular scientific programming languages such as C, C++, and
Python, which use a row-major ordering. The implication for writing loops is
that it is always best to order loops by which indices vary quickest. In this
way, the loop nest will traverse contiguous memory, which can be done
efficiently.
In Fortran, MATLAB, and R, arrays are stored in column-major order, meaning that
the entries from the left-most index are stored contiguously in memory. This is
often referred to as the left-most index varying quickest. This is different
from many other popular scientific programming languages such as C, C++, and
Python, which use a row-major ordering. The implication for writing loops is
that it is always best to order loops by which indices vary quickest. In this
way, the loop nest will traverse contiguous memory, which can be done
efficiently.
Example Code
Consider the following Fortran example involving a loop over a 3D array, which
uses the recommended column-major ordering:
integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c
integer :: i, j, k
! Initialisation of a and b omitted
! Hand-coded c = a + b using column-major
do k = 1, n
do j = 1, n
do i = 1, n
c(i,j,k) = a(i,j,k) + b(i,j,k)
end do
end do
end do
Without compiler optimisations, this will likely run several times faster than the (not
recommended) equivalent row-major version:
integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c
integer :: i, j, k
! Initialisation of a and b omitted
! Hand-coded c = a + b using row-major
do i = 1, n
do j = 1, n
do k = 1, n
c(i,j,k) = a(i,j,k) + b(i,j,k)
end do
end do
end do
With compiler optimisations, simple loops such as this where the result of the sum are unused would
however likely be optimised out by the compiler. In practice you wouldn’t normally have a redundant loop in your code, so this shouldn’t be a concern.
A further comment on the simple loops above is that the same could be achieved
with an array operation, which has a more compact notation:
integer, parameter :: n = 500
real, dimension(n, n, n) :: a, b, c
! Initialisation of a and b omitted
! Compute c = a + b with an array operation
c(:,:,:) = a + b
Benchmark
Below these examples have been extended into a full benchmark, to demonstrate the impact. An additional checksum is used to ensure the code we’re benchmarking isn’t optimised away by the compiler.
Benchmark Code
```fortran
program benchmark_loops
implicit none
integer, parameter :: n = 1000
real, allocatable :: a(:,:,:), b(:,:,:), c(:,:,:)
real :: t1, t2
integer :: i
real :: checksum
allocate(a(n,n,n), b(n,n,n), c(n,n,n))
! Initialize data
a = 1.0
b = 2.0
c = 0.0
! Warm-up runs (optional but recommended)
call col_major_add(a, b, c, n)
call row_major_add(a, b, c, n)
! Benchmark column-major loop order
call cpu_time(t1)
call col_major_add(a, b, c, n)
call cpu_time(t2)
print *, "Column-major time (s): ", t2 - t1
checksum = sum(c)
print *," checksum: ", checksum
! Reset result array to avoid reuse artifacts
c = 0.0
! Benchmark row-major loop order
call cpu_time(t1)
call row_major_add(a, b, c, n)
call cpu_time(t2)
print *, "Row-major time (s): ", t2 - t1
checksum = sum(c)
print *," checksum: ", checksum
contains
subroutine col_major_add(a, b, c, n)
integer, intent(in) :: n
real, intent(in) :: a(n,n,n), b(n,n,n)
real, intent(out) :: c(n,n,n)
integer :: i, j, k
do k = 1, n
do j = 1, n
do i = 1, n
c(i,j,k) = a(i,j,k) + b(i,j,k)
end do
end do
end do
end subroutine col_major_add
subroutine row_major_add(a, b, c, n)
integer, intent(in) :: n
real, intent(in) :: a(n,n,n), b(n,n,n)
real, intent(out) :: c(n,n,n)
integer :: i, j, k
do i = 1, n
do j = 1, n
do k = 1, n
c(i,j,k) = a(i,j,k) + b(i,j,k)
end do
end do
end do
end subroutine row_major_add
end program benchmark_loops
```
Compiled with GFortran and compiler optimisations enabled, column-major accesses were 2.4x faster under Ubuntu!
$ gfortran -O3 benchmark.f90
$ ./a.out
Column-major time (s): 3.98375320
checksum: 67108864.0
Row-major time (s): 9.72134018
checksum: 67108864.0
Under WSL on different hardware a greater 7.1x speedup was seen, so results are likely to vary, but should remain positive.
$ gfortran -O3 benchmark.f90
$ ./a.out
Column-major time (s): 0.440809250
checksum: 67108864.0
Row-major time (s): 3.11280060
checksum: 67108864.0
Technical Details
The underlying reason that order has such a big impact here is due to how computer memory operates. When a processor loads a variable from RAM into it’s caches, it doesn’t load an individual variable (likely 4 bytes), it loads a full cache line (likely 64 bytes).
Therefore, variables stored consecutively in memory will be loaded at the same time if they fall within the same cache line. This means the processor does not need to go to RAM to access the variable, which is orders of magnitude higher latency. Due to how Fortran lays out memory, column-major ordering achieves this.
In contrast, if you operate as though memory is row-major, consecutively accessed variables will be stored a long way apart from each other in memory. Hence each individual access will need to load a full cache line from RAM. Due to this high turnover, it’s likely by the time a second variable would be accessed from any cache line, that the cache line has been evicted to be replaced with a fresher load.
If you’re working with 4-byte types you could be doing 16x the RAM accesses, with 8-byte types it’s still 8x!
Furthermore, in order for the compiler to perform vectorisation, it needs to be able to apply vector instructions to a full cache line of variables. If you are not operating consecutively on variables within a cache line, it can be much harder for the compiler to detect that vectorisation is appropriate.