Memory Limits in Data Table: Negative Length Vectors Are Not Allowed

Merge error : negative length vectors are not allowed

You are getting this error because the data.frame / data.table created by the join has more than 2^31 - 1 rows (2,147,483,647).

Due to the way vectors are constructed internally by R, the maximum length of any vector is 2^31 - 1 elements (see: https://stackoverflow.com/a/5234293/2341679). Since a data.frame / data.table is really a list() of vectors, this limit also applies to the number of rows.

As other people have commented and answered, unfortunately you won't be able to construct this data.table, and its likely there are that many rows because of duplicate matches between your two data.tables (these may or may not be intentional on your part).

The good news is, if the duplicate matches are not errors, and you still want to perform the join, there is a way around it: you just need to do whatever computation you wanted to do on the resulting data.table in the same call as the join using the data.table[] operator, e.g.:

dt_left[dt_right, on = .(GVKEY, YEAR),
        j = .(sum(firm_related_wealth), mean(fracdirafterindep),
        by = .EACHI]

If you're not familiar with the data.table syntax, you can perform calculations on columns within a data.table as shown above using the j argument. When performing a join using this syntax, computation in j is performed on the data.table created by the join.

The key here is the by = .EACHI argument. This breaks the join (and subsequent computation in j) down into smaller components: one data.table for each row in dt_right and its matches in dt_left, avoiding the problem of creating a data.table with > 2^31 - 1 rows.

Rcpp R vector size limit (negative length vectors are not allowed)

The problem is multiplication overflow. When you do

size * (size - 1) / 2

order of operations bites you, because

size * (size - 1)

can overflow even if the overall expression doesn't.
We can see this by adding a printing statement:

IntegerVector test(int size) {
    int veclen = size * (size - 1) / 2;
    Rcpp::Rcout << veclen << std::endl;
    IntegerVector vec(veclen);
    return vec;
}

vec <- test(47000)
# -1043007148

So, we can fix it by changing up how we do that operation:

IntegerVector test(int size) {
    int veclen = (size / 2) * (size - 1);
    Rcpp::Rcout << veclen << std::endl;
    IntegerVector vec(veclen);
    return vec;
}

which gives no issue

vec <- test(47000)
# 1104476500
str(vec)
# int [1:1104476500] 0 0 0 0 0 0 0 0 0 0 ...

Update: The problem with odd numbers

Eli Korvigo brings up an excellent point in the comments about integer division behavior with odd numbers. To illustrate consider calling the function with the even number 4 and the odd number 5

even <- 4
odd  <- 5

even * (even - 1) / 2
# [1] 6
odd  * (odd  - 1) / 2
# [1] 10

It should create vectors of length 6 and 10 respectively.
But, what happens?

test(4)
# 6
# [1] 0 0 0 0 0 0
test(5)
# 8
# [1] 0 0 0 0 0 0 0 0

Oh no!
5 / 2 in integer division is 2, not 2.5, so this does not quite do what we want in the odd case.
However, luckily we can easily address this with a simple flow control:

IntegerVector test2(int size) {
    int veclen;
    if ( size % 2 == 0 ) {
        veclen = (size / 2) * (size - 1);
    } else {
        veclen = size * ((size - 1) / 2);
    }
    Rcpp::Rcout << veclen << std::endl;
    IntegerVector vec(veclen);
    return vec;
}

We can see this handles the odd and even cases both just fine:

test2(4)
# 6
# [1] 0 0 0 0 0 0
test2(5)
# 10
# [1] 0 0 0 0 0 0 0 0 0 0

R - joining more than 2^31 rows with data.table

Update

If you just want to query the common neighbors, I don't suggest you build up a huge look-up table. Instead, you can use the following code to get the result for your query:

find_common_neighbors <- function(g, Vs) {
  which(colSums(distances(g, Vs) == 1) == length(Vs))
}

such that

> find_common_neighbors(g, c(4, 8))
integer(0)

> find_common_neighbors(g, c(4, 5))
[1] 8

If you need a look-up table, an alternative is to use Neighbours as the key to search its associated node, e.g.,

res <- transform(
  data.frame(Neighbours = which(degree(g) >= 2)),
  Nodes = sapply(
    Neighbours,
    function(x) toString(neighbors(g, x))
  )
)

Previous Answer

I think you can use ego over g directly to generate res, e.g.,

setNames(
  data.frame(
    t(do.call(
      cbind,
      lapply(
        Filter(function(x) length(x) > 2, ego(g, 1)),
        function(x) {
          rbind(combn(x[-1], 2), x[1])
        }
      )
    ))
  ),
  c("V1", "V2", "Neighbours")
)

which gives

  V1 V2 Neighbours
1  4  5          8
2  4 10          8
3  5 10          8

Not enough memory to row bind two large datasets

If the combined dataset can fit into memory, you could try combining the tables in a CSV via fwrite with append = TRUE.

library(data.table)
fwrite(
  rbindlist(
    list(
      fread("A.csv", nrows = 1L),
      fread("B.csv")
    ),
    fill = TRUE
  )[-1],
  "AB.csv"
)
fwrite(
  fread("A.csv"),
  "AB.csv",
  append = TRUE
)
# maybe restart R here
AB <- fread("AB.csv", fill = TRUE)

iterate through 2 big dataframes with different length (if, else)

We can do a join with data.table for efficiently creating the column 'Location'

library(data.table)
setDT(df1)[df2, Location := Location, on = .(ID)]