More Efficient Strategy for Which() or Match()

More efficient strategy for which() or match()

The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

f3 <- function(x) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= 0)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

For comparison with the other suggestions

f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L

and for fun

library(compiler)
f3.c <- cmpfun(f3)

Leading to

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
      expr       min        lq     median         uq       max neval
   f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
   f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
   f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
   f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
 f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

library(inline)
f4 <- cfunction(c(x = "numeric"), "
    int imin = 0, imax = Rf_length(x) - 1, imid;
    while (imax >= imin) {
        imid = imin + (imax - imin) / 2;
        if (REAL(x)[imid] >= 0)
            imax = imid - 1;
        else
            imin = imid + 1;
    }
    return ScalarInteger(imax + 1);
")

with

> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
      expr   min      lq  median      uq   max neval
   f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
   f4(vec)   553   796.0   893.5  1004.5  2908   100

findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

f5.i <- function(v)
    .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))

is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

Is there a more efficient version of match for searching repeated permutations of numbers?

I think the major bottleneck is you are increasing the size of the vector in the loop. Try to initialize it before the loop and assign the value in the vector.

list_vec <- numeric(length = n)

for(k in 1:n) {
  list_vec[k] <- match(a, data[,k])
}

Or using sapply

sapply(data, function(x) match(a, x))

Why is the R match function so slow?

The reason is that R is not actually doing linear search, but it sets up a hash table. This is effective if you are searching for several items, but not very effective if you are searching for one number only. Here is the profiling of the function:

Sample Image

A "better" implementation could use a linear search, if you are searching for a single integer in an array. I guess that would be faster.

Is there an efficient strategy for doing a fuzzy join on customer data to identify a single customer ID in R?

Before getting to the process of finding duplicates it is important to get/gather good data to begin with.

You have mentioned first name, last name, email and phone number. First names are good, since they usually don’t change unlike email addresses and phone numbers. Last names can change through marriage/divorce. Therefore, it is always good to have other time-invariant variables such as “date of birth” or “place of birth”.

Even with good data, there will always be a challenge matching first, last names and date of birth in a large customer database.

As you point out in your comments, a string distance matrix of 100,000 plus customers takes time and causes memory problems.

One work around here is to sort the data and break it into pieces. Create a string distance matrix on each small piece, get some likely matches and piece everything back together. There are different approaches on how to do that, and I will just show how it works in principle and maybe you can expand on this.

I downloaded some fake data of 1,000 records. Unfortunately, it does not contain duplicates, but for showing the Basic workflow it does not deed real duplicates.

The approach takes the following steps:

Create a name field based on last and first name.
Arrange it in ascending order (A-Z).
Break it down into groups of 50 customers (this is for my example data with 1,000 rows, actually running groups of 500 should be no problem in terms of speed and memory).
Create a nested tibble to work on with purrr::map.
Apply a customized stringdistmatrix function that works in the dplyr pipe and gives likely matches between names of customers as output.
Unnest the single results to get a complete list of potential
matches.

The idea behind breaking down the data is that you do not need a string distance matrix of all 100,000 customers. Most of the names are so different that you do not even need to calculate a string distance. Sorting the names and working on small subsets is like narrowing down the search.

Of course this is just one way to break down the data. It is incomplete, since it misses, for example, all customers with a typo in the first letter of the last name. However, you can replicate this approach for other variables such as date of birth, number of characters in a name etc. Ideally you do different break downs and piece everything together in the end.

I downloaded some fake date via www.mockaroo.com. I tried to put it here with dput, but it was to long. So I just show you the head() of my data and you can create your own fake data or use real customer data.

One note regarding my customized version of stringdistmatrix which I named str_dist_mtx. When working with real data you should adjust the size of the group (in the example it is rather small n = 50). And you should adjust the string distance string_dist up until which you want to consider two differing names as potential matches. I took 6 to at least get some results, but I am not working with data which has real duplicates. So in a real application I would choose 1 or 2 to cover the most basic typos.

# the head() of my data
test_data <- structure(list(first_name = c("Gabriel", "Roscoe", "Will", "Francyne", 
"Giorgi", "Dulcinea"), last_name = c("Jeandeau", "Chmiel", "Tuckwell", 
"Vaggers", "Fairnie", "Tommis"), date_of_birth = structure(c(9161, 
4150, 2557, 9437, -884, -4489), class = "Date")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Below is the code I used.

library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(stringdist)

# customized stringdistmatrix function
str_dist_mtx <- function(df, x, string_dist, n) {

  temp_mtx = stringdistmatrix(df[[x]],df[[x]])

  temp_tbl = tibble(name1 = rep(df[[x]], each = n),
                    name2 = rep(df[[x]], times = n),
                    str_dist = as.vector(temp_mtx)) %>% 
             filter(str_dist > 0 & str_dist < string_dist) 

  temp_tbl[!duplicated(data.frame(t(apply(temp_tbl,1,sort)))),]

}

# dplyr pipe doing the job
test_data2 <- test_data %>%
                mutate(name = paste0(last_name, first_name)) %>% 
                arrange(name) %>%
                mutate(slice_id = row_number(),
                      slice_id = cut_width(slice_id, 50, center = 25)) %>% 
                nest(-slice_id) %>% 
                mutate(str_mtx = map(data,
                                     ~ str_dist_mtx(., "name", string_dist = 6, n = 50))) %>% 
                select(str_mtx) %>% 
                unnest()

Assigning numeric value on the basis of positive or negative entry and sum up

Assuming that we wanted to change the values in all the columns except the first two, get the sign of the columns (df1[-(1:2)]) and multiply by 2. If we need to create a new column ('newCol'), then use rowSums on the changed column values.

 df1[-(1:2)] <- sign(df1[-(1:2)])*2
 df1$newCol <- rowSums(df1[-(1:2)])
 head(df1,3)
 #    V1    V2 V3 V4 V5 newCol
 #1 1376 PSEN1  2  2  2      6
 #2 1377 PSMA2 -2  2  2      2
 #3 1378 PSMA3  2 -2  2      2

NOTE: I replaced the values with the new values based on the expected output. You can make a copy of the original dataset if required.

Or use Reduce

 df1$newCol <- Reduce(`+`, df1[-(1:2)])

Update

If 0 belongs to positive category,

rowSums(`dim<-`(c(-2, 2)[(df1[-(1:2)]>=0)+1L], dim(df1[-(1:2)])))
#[1]  6  2  2 -2  6 -2  6  6  6

Or based on the original solution,

rowSums((sign(df1[-(1:2)]) + !df1[-(1:2)])*2)
#[1]  6  2  2 -2  6 -2  6  6  6

data

df1 <-   structure(list(V1 = 1376:1384, V2 = c("PSEN1", "PSMA2", 
"PSMA3", 
"PSMA4", "PSMA6", "PSMB2", "PSMB3", "PSMB5", "PSMB7"), 
V3 = c(1.4057115, 
-1.6285915, 0.754753, -0.9455922, 0.8299074, -2.0481873, 1.5311188, 
1.5259254, 0.653801), V4 = c(0.1254332, 1.2343333, -3, 0, 32.0065, 
0, 2.4958, 5.0000456, 5.0654087), V5 = c(0.1254332, 1.2343333, 
0.8299074, -1.6285915, 1.5311143, -2.4958, 4.56e-05, 3.653801, 
2.4934745)), .Names = c("V1", "V2", "V3", "V4", "V5"),
class =    "data.frame", row.names = c(NA, -9L))

Get indices of two values that bracket zero in R

good scenario

If you are in the classic case, where

your vector is sorted in increasing order,
it does not include 0,
it has no tied values,

you can simply do the following:

findInterval(0, x, TRUE) + 0:1

If condition 1 does not hold, but condition 2 and 3 still hold, you can do

sig <- order(x)
sig[findInterval(0, x[sig], TRUE) + 0:1]

akrun's answer is fundamentally the same.

bad scenario

Things become tricky once your vector x contains 0 or tied / repeated values, because:

repeated values challenge sorting based method, as sorting method like "quick sort" is not stable (see What is stability in sorting algorithms and why is it important? if you don't know what a stable sort is);
findInterval will locate exactly 0 at presence of 0.

In this situation, you have to adapt Ronak Shah's answer which allows you to exclude 0. But be aware that which may give you multiple indexes if there are repeated values.

Most efficient way to repeatedly pair up a group of users for a quick game

You need to matchmaking system that will prioritize players who have been waiting the longest.

You only need 1 queue and you also need to keep track of users history using a table. The table can either be temporary session data or a database table if you want permanent data across multiple sessions or if the match making server crashes. The table should have the playerID and an array of previous playerIDs that they previously played against. It can be best to limit the size of the array and use LIFO as you might not want to just store the players most recent match ups i.e. match history. Also the player could run out of players to play against if they already played against everyone else online. The table should look like this:

playerID (integer)
previousPlayerIDs (array of integers)

When a match starts you can update the the previousPlayerIDs for all the players in the match. You need to listen to an event when a player has joined the queue lets call it onPlayerJoin(). If the queue has more than 1 player you should take longest queuing player and compare their playerID against the previousPlayerIDs of each player until you find no history of a match up.

const historyLimit = 10;

function onPlayerJoin(Player newPlayer){
  playerQueue.push(newPlayer);
  if(playerQueue.length > 1){
    for(let a=0; a<playerQueue.length-1; a++){
      Player player = playerQueue[a];
      for(int i=a+1; i<playerQueue.length; i++){
        Player otherPlayer = playerQueue[i];

        //if the player have not played before
        if(otherPlayer.previousPlayerIDs.indexOf(player.id) > -1){

          //save match up
          player.previousPlayerIDs.push(otherPlayer.id);
          otherPlayer.previousPlayerIDs.push(player.id);

          //limit matchup histroy
          if(player.previousPlayerIDs.length > historyLimit){
            player.previousPlayerIDs.removeAt(0);
          }
          if(otherPlayer.previousPlayerIDs.length > historyLimit){
            otherPlayer.previousPlayerIDs.removeAt(0);
          }

          //create lobby and remove players from the queue
          createLobby(player, otherPlayer);
          playerQueue.removeAt(a);
          playerQueue.removeAt(i);
        }
      }
    }
  }
}

It is possible for a player to have played everyone else and they are waiting for someone to come online that they haven't played against before. You will need a reoccurring event to check if the longest waiting player has been waiting too long. If this is the case just ignore the matching of previousPlayerIDs and create a lobby for the the player up with another potentially long waiting player.

If you want you could add more columns to the table such as a timestamp when they joined the queue and their match making rank (elo). But if you just want to prioritize the most recent player you don't need these other columns.

Also this solution might not scale up very well if you have massive amounts of concurrent user but it should be fine if you have less than 1,000-10,000

Extract the closest adjacent indices to an index in a vector

Here are a few approaches:

1) Using order

sort(head(order(abs(1:N - ind)), 11L))

2) Sort the absolute difference between indices and ind and then take the first 11 positions

sort((1:N)[as.integer(names(head(sort(abs(setNames(1:N, 1:N) - ind)), 11L)))])

3) Use radix sort:

sort(head(sort(abs(1:N - ind), index.return=TRUE)$ix, 11L))

More Efficient Strategy for Which() or Match()