Dont Grow Dataframes

When you grow a data.frame by one, for example using rbind() or cbind() to add a row or column, R will allocate a new data.frame one larger and copy across all the old data. If you’re doing this regularly, that’s a lot of redundant expensive memory movement. Either pre-allocate your data.frame to the correct size at the start, or use a data.table instead which provides faster growth (and many other performance improvements).

Prior to R 3.4, vectors suffered from the same overhead during growth. However, the newer implementation ensures this overhead of redundant copies is not incurred most of the time. It may still occur when there are multiple references to a growing vector, so beware.

Example Code

It might feel natural to grow a data.frame row-by-row as you collect data.

# Allocate a data.frame with 3 columns
df <- data.frame(x = numeric(), y = numeric(), z = numeric())
# Append 10,000 rows
for (i in 1:10000) {
  new_row <- data.frame(x = i, y = i * 2, z = i * 3)
  df <- rbind(df, new_row)
}

This took 4.42 seconds (141.91 seconds for 100,000 rows!).

But if we initialise the data-frame with the correct number of rows, and instead update the elements in-place.

# Allocate a data.frame with 3 columns and preallocate 10,000 rows.
df <- data.frame(x = numeric(10000), y = numeric(10000), z = numeric(10000))
# Append 10,000 rows
for (i in 1:10000) {
  df$x[i] <- i
  df$y[i] <- i * 2
  df$z[i] <- i * 3
}

This now executes in 0.81 seconds (26.56 seconds for 100,000 rows).

Using lapply followed by do.call to rbind the rows is slightly faster:

list_of_rows <- lapply(1:100000, function(i){
    data.frame(x = i, y = i * 2, z = i * 3)
  })
df <- do.call(rbind, list_of_rows)

This executes in ~0.75 seconds (20.78 seconds for 100,000 rows).

We can however improve this even further, by taking advantage of vectorisation.

This can be done using the data.table library:

# Import the data.table library
library(data.table)
df <- data.table(x = numeric(10000), y = numeric(10000), z = numeric(10000))
rows <- 1:10000
df[rows, `:=`(
  x = rows,
  y = rows * 2,
  z = rows * 3
)]

This now executes in 0.05 seconds (and doesn’t get slower for 100,000 elements).

If you need to convert it back to a data.frame, you can call as.data.frame() passing your data.table as a parameter.

The dplyr library is even faster in this case:

# Import the dplyr package
library(dplyr)
df <- data.frame(x = 1:10000)
dplyr::mutate(df, y = x * 2, z = x * 3)

This executes in 0.001 seconds (and doesn’t get slower for 100,000 elements).

The Technical Detail

R data-frames are column major, and many operations force the columns to be re-allocated. In particular, if you add a row each column of the data-table will be re-allocated.

data.table is almost always equal if not faster in performance to the core data.frame, so there’s little reason not to use it.