Efficient Alternatives to Merge for Larger Data.Frames R

Efficient alternatives to merge for larger data.frames R

Here's the obligatory data.table example:

library(data.table)

## Fix up your example data.frame so that the columns aren't all factors
## (not necessary, but shows that data.table can now use numeric columns as keys)
cols <- c(1:5, 7:10)
test[cols] <- lapply(cols, FUN=function(X) as.numeric(as.character(test[[X]])))
test[11] <- as.logical(test[[11]])

## Create two data.tables with which to demonstrate a data.table merge
dt <- data.table(test, key=names(test))
dt2 <- copy(dt)
## Add to each one a unique non-keyed column
dt$X <- seq_len(nrow(dt))
dt2$Y <- rev(seq_len(nrow(dt)))

## Merge them based on the keyed columns (in both cases, all but the last) to ...
## (1) create a new data.table
dt3 <- dt[dt2]
## (2) or (poss. minimizing memory usage), just add column Y from dt2 to dt
dt[dt2,Y:=Y]

What's the fastest way to merge/join data.frames in R?

The match approach works when there is a unique key in the second data frame for each key value in the first. If there are duplicates in the second data frame then the match and merge approaches are not the same. Match is, of course, faster since it is not doing as much. In particular it never looks for duplicate keys. (continued after code)

DF1 = data.frame(a = c(1, 1, 2, 2), b = 1:4)
DF2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5])
merge(DF1, DF2)
    b a c
  1 1 1 a
  2 2 1 b
  3 3 2 c
  4 3 2 d
  5 4 2 e
DF1$c = DF2$c[match(DF1$b, DF2$b)]
DF1$c
[1] a b c e
Levels: a b c d e

> DF1
  a b c
1 1 1 a
2 1 2 b
3 2 3 c
4 2 4 e

In the sqldf code that was posted in the question, it might appear that indexes were used on the two tables but, in fact, they are placed on tables which were overwritten before the sql select ever runs and that, in part, accounts for why its so slow. The idea of sqldf is that the data frames in your R session constitute the data base, not the tables in sqlite. Thus each time the code refers to an unqualified table name it will look in your R workspace for it -- not in sqlite's main database. Thus the select statement that was shown reads d1 and d2 from the workspace into sqlite's main database clobbering the ones that were there with the indexes. As a result it does a join with no indexes. If you wanted to make use of the versions of d1 and d2 that were in sqlite's main database you would have to refer to them as main.d1 and main.d2 and not as d1 and d2. Also, if you are trying to make it run as fast as possible then note that a simple join can't make use of indexes on both tables so you can save the time of creating one of the indexes. In the code below we illustrate these points.

Its worthwhile to notice that the precise computation can make a huge difference on which package is fastest. For example, we do a merge and an aggregate below. We see that the results are nearly reversed for the two. In the first example from fastest to slowest we get: data.table, plyr, merge and sqldf whereas in the second example sqldf, aggregate, data.table and plyr -- nearly the reverse of the first one. In the first example sqldf is 3x slower than data.table and in the second its 200x faster than plyr and 100 times faster than data.table. Below we show the input code, the output timings for the merge and the output timings for the aggregate. Its also worthwhile noting that sqldf is based on a database and therefore can handle objects larger than R can handle (if you use the dbname argument of sqldf) while the other approaches are limited to processing in main memory. Also we have illustrated sqldf with sqlite but it also supports the H2 and PostgreSQL databases as well.

library(plyr)
library(data.table)
library(sqldf)

set.seed(123)
N <- 1e5
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(d1, g1, g2)

library(rbenchmark)

benchmark(replications = 1, order = "elapsed",
   merge = merge(d1, d2),
   plyr = join(d1, d2),
   data.table = { 
      dt1 <- data.table(d1, key = "x")
      dt2 <- data.table(d2, key = "x")
      data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
      },
   sqldf = sqldf(c("create index ix1 on d1(x)",
      "select * from main.d1 join d2 using(x)"))
)

set.seed(123)
N <- 1e5
g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

benchmark(replications = 1, order = "elapsed",
   aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean), 
   data.table = {
      dt <- data.table(d, key = "g1,g2")
      dt[, colMeans(cbind(x, y)), by = "g1,g2"]
   },
   plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),
   sqldf = sqldf(c("create index ix on d(g1, g2)",
      "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))
)

The outputs from the two benchmark call comparing the merge calculations are:

Joining by: x
        test replications elapsed relative user.self sys.self user.child sys.child
3 data.table            1    0.34 1.000000      0.31     0.01         NA        NA
2       plyr            1    0.44 1.294118      0.39     0.02         NA        NA
1      merge            1    1.17 3.441176      1.10     0.04         NA        NA
4      sqldf            1    3.34 9.823529      3.24     0.04         NA        NA

The output from the benchmark call comparing the aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child
4      sqldf            1    2.81  1.000000      2.73     0.02         NA        NA
1  aggregate            1   14.89  5.298932     14.89     0.00         NA        NA
2 data.table            1  132.46 47.138790    131.70     0.08         NA        NA
3       plyr            1  212.69 75.690391    211.57     0.56         NA        NA

A faster way to combine a large number of datasets in R?

So, I've done some tinkering and come to a first solution. Batch combining the datasets with rbindlist indeed proved to be a great efficiency boost.

I believe this does have to do mostly with the pre-allocation of size that rbindlist does automatically. I was struggling a bit with finding an elegant way to pre-allocate the dataset that did not involve hand-naming all 88 columns and wildly guessing the final size of the dataset (due to the removal of duplicates in the process). So I didn't benchmark the rbindlist solution against an rbind solution with pre-allocated dataset size.

So here's a solution for batching the files and combining them via rbindlist. The process of combining all 858 datasets onto an initial dataset of ~236000 tweets of an initial search clocked in at 909.68 system.time() and a dataset of ~2.5 million rows.

filenames <- list.files(pattern = "dataset_*") 
   
full_data <- as.data.table(data_init)
    
  for (i in seq(1,length(filenames),13)){  # sequence into batches (here: 13)
    files <- filenames[i:(i+12)] # loop through the batched files to load 
    for (j in 1:length(files)) {
      load(files[j])
      print(paste((i+j-1), "/", length(filenames), ":", files[j]))} # indicates current dataset as number of total datasets
    full_data <- rbindlist(mget(c(ls(pattern="dataset"), ls(pattern="full_data"))))   # add new datasets
    print("- batch combined -")
    rm(list=ls(pattern="dataset"))  # remove data
    full_data <- unique(full_data, fromLast = T, by="status_id")  #remove duplicate tweets in data
  }

I've divided them into batches of 13, since that evens out nicely with the 858 datasets. Some testing showed that batches of ~8 datasets might be more efficient. Not sure what would be an elegant way to deal with the number of files not evening out with the batches though.

Efficient Combination and Operating on Large Data Frames

It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.

require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)

Efficiently merging large data.tables

Keyed assignment should save memory.

dt1[dt2, on = "id", x5 := x5]

Should we use a DB library to get this done?

That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite package. It's pretty simple.

My experiment

tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.

I wrote two scripts that each sourced a setup script, dt-setup.R to create dt1 and dt2. The first script, dt-merge.R, updated dt1 through the "merge" method. The second, dt-keyed-assign.R, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem() function.

To not torture my laptop, I'm having dt1 be 500,000 rows and dt2 3,000 rows.

Scripts:

# dt-setup.R
library(data.table)

set.seed(9474)
id_space <- seq_len(3000)
dt1  <- data.table(
  id = sample(id_space, 500000, replace = TRUE),
  x1 = runif(500000),
  x2 = runif(500000),
  x3 = runif(500000),
  x4 = runif(500000)
)
dt2 <- data.table(
  id = id_space,
  x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)

# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)

# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)

With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.

system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")

I think the lines in the output files generally follow the pattern "<bytes> :<call stack>". I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc for vectors.

Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.

Some sample lines of the memory report:

cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "[" 
# 528448 :"get" "[" 
# 528448 :"get" "[" 
# 1072 :"get" "[" 
# 20608 :"get" "["

There are also lines like new page:"get" "[" for page allocations.

Luckily, these are simple to parse.

parse_memory_report <- function(path) {
  report <- readLines(path)
  new_pages <- startsWith(report, "new page:")
  allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
  total_malloced <- sum(as.numeric(allocations))
  message(
    "Summary of ", path, ":\n",
    sum(new_pages), " new pages allocated\n",
    sum(as.numeric(allocations)), " bytes malloced"
  )
}

parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced

parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced

I got exactly the same results when repeating the experiment.

So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.

So, ignoring pages, keyed assignment allocated 55% less memory than the merge.

Sub setting very large data frames in R efficiently

You could try

library(dplyr)
 all_df %>% 
       group_by(A) %>% 
       summarise_each(funs(mean, sd), B:G)

Or another option is data.table

library(data.table)
setDT(all_df)[, lapply(.SD, function(x) c(mean(x), sd(x))), by = A,
              .SDcols=LETTERS[2:6]][,var:= c('mean', 'sd')][]

NOTE: The results in the first form is in the wide format, while in the second we get the 'mean', 'sd' as alternative rows.

Benchmarks

 all_df1 <- all_df[rep(1:nrow(all_df), 1e5),]
 system.time(all_df1%>% group_by(A) %>% summarise_each(funs(mean, sd), B:G))
 #  user  system elapsed 
 # 0.189   0.000   0.189 

 DT1 <- as.data.table(all_df1)
 system.time(DT1[,lapply(.SD, function(x) c(mean(x), sd(x))),
              A, .SDcols=LETTERS[2:6]][,var:= c('mean', 'sd')][])
 #  user  system elapsed 
 #0.232   0.002   0.235

data

set.seed(25)
m1 <- matrix(sample(1:20, 15*20, replace=TRUE), ncol=15)
set.seed(353)
all_df <- data.frame(sample(letters[1:3], 20, replace=TRUE), m1)
colnames(d1) <- LETTERS[1:ncol(d1)]

Efficiently merging two data frames on a non-trivial criteria

A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

~~Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.~~ I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

Merge two data frames by column name (merge() doesn't work)

Just use tolower in merge if the capitalization in df1 is the problem.

merge(transform(df1, Username=tolower(Username)), df2, all=TRUE)
#          Username           twitterID TwitterID                 Name
# 1         achim_p          1117749912        NA           Achim Post
# 2         achim_p          1117749912        NA           Achim Post
# 3         achim_p          1117749912        NA Achim Post  (Minden)
# 4    achimkessler  849567328899616768        NA        Achim Kessler
# 5   agnieszka_mdb           172269309        NA                 <NA>
# 6       bdobrindt 1178640571725955073        NA                 <NA>
# 7 stegemannalbert 1127961248493129728        NA                 <NA>
# 8            <NA>           186552155        NA       Adis Ahmetovic
# 9            <NA>           186552155        NA         Agnes Alpers

Or:

merge(transform(df1, Username=tolower(Username)), df2[-2], all.y=TRUE) |>
  (\(x) {x[, 'Name'] <- gsub('\\s+\\(.*', '', x[, 'Name']);x})() |>
  unique()
#       Username          twitterID           Name
# 1      achim_p         1117749912     Achim Post
# 4 achimkessler 849567328899616768  Achim Kessler
# 5         <NA>          186552155 Adis Ahmetovic
# 6         <NA>          186552155   Agnes Alpers

Otherwise, if I were you, I would take a look at ?agrep for approximate string matching.

> R.version.string
[1] "R version 4.1.2 (2021-11-01)"

Data:

df1 <- structure(list(twitterID = c("849567328899616768", "1117749912", 
"186552155", "172269309", "1127961248493129728", "1178640571725955073"
), Username = c("AchimKessler", "Achim_P", NA, "agnieszka_mdb", 
"StegemannAlbert", "BDobrindt")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

df2 <- structure(list(Username = c("achimkessler", "achim_p", "achim_p", 
"achim_p", NA, NA), TwitterID = c(NA, NA, NA, NA, NA, NA), Name = c("Achim Kessler", 
"Achim Post", "Achim Post", "Achim Post  (Minden)", "Adis Ahmetovic", 
"Agnes Alpers")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6"))

Efficient Alternatives to Merge for Larger Data.Frames R