Adjacency Matrix in R

Creating an Adjacency Matrix in R

We can assign an id column for Case and get the data in wide format. Then use count to count how many times the combination of Plaintiff and Defendant occur.

library(dplyr)
library(tidyr)

df %>%
  mutate(Case = dense_rank(Case)) %>%
  pivot_wider(names_from = Status, values_from = Person) %>%
  count(Plaintiff, Defendant)

#  Plaintiff Defendant     n
#      <int>     <int> <int>
#1     11111     44444     1
#2     22222     99999     1
#3     33333     44444     1
#4     44444     99999     1
#5     55555     66666     1
#6     77777     22222     1
#7     99999     11111     1
#8     99999     88888     1

Create adjacency matrix from a path given as a vector of nodes in base R

You can use the powerful but little-known trick of matrix-based indexing:

index_mat <- rbind(
               c(1, 2),
               c(2, 3), 
               c(3, 1)
             )

mat <- matrix(FALSE, 3, 3)
mat[index_mat] <- TRUE
mat
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,] FALSE FALSE  TRUE
[3,]  TRUE FALSE FALSE

So do this:

path_to_D <- function (path, p) {
  indices <- cbind(path[-length(path)], path[-1])
  D <- matrix(0, p, p)
  D[indices] <- 1
  D
}

D <- path_to_D(hypothesis_path, 25)
which(D == 1, arr.ind=TRUE)

     row col
[1,]   6   1
[2,]   7   6
[3,]  17   7

Convert data frame into adjacency matrix format in R

This sounds like a textbook case for using a pivot. There are many ways to do this in R. Here's a base R solution:

#Your example data:
original <- data.frame(Sample_A = c("Rabit", "Mouse", "Tiger", "Mouse", "Lion"),
                       Sample_B = c("Mouse", "Tiger", "Lion", "Rabit", "Rabit"),
                       Value = c(1.2, 7.89, -0.9, 1.2, 98.13))
#Pivot:
with(original, tapply(Value, list(Sample_A, Sample_B), FUN = identity))

Output:

      Lion Mouse Rabit Tiger
Lion    NA    NA 98.13    NA
Mouse   NA    NA  1.20  7.89
Rabit   NA   1.2    NA    NA
Tiger -0.9    NA    NA    NA

adjacency matrix R

Here are two different possibilities:

Method 1

Using igraph::get.adjacency

# Convert data.frame to igraph
library(igraph);
ig <- graph_from_data_frame(df);

# Get adjacency matrix
get.adjacency(ig);
#6 x 6 sparse Matrix of class "dgCMatrix"
#                  Kanye West Beyonce Kid Cudi Jay-Z Nicki Minaj
#Kanye West                 .       .        .     1           .
#Beyonce                    .       .        .     .           1
#Kid Cudi                   .       .        .     .           .
#Jay-Z                      .       .        .     .           .
#Nicki Minaj                .       .        .     .           .
#Chance The Rapper          .       .        .     .           .
#                  Chance The Rapper
#Kanye West                        .
#Beyonce                           .
#Kid Cudi                          1
#Jay-Z                             .
#Nicki Minaj                       .
#Chance The Rapper                 .

Method 2

Using reshape2::dcast

library(reshape2);
df$from <- factor(df$from, levels = unique(as.character(unlist(df))))
df$to <- factor(df$to, levels = unique(as.character(unlist(df))))
dcast(df, from ~ to, fun.aggregate = length, drop = F);
#               from Kanye West Beyonce Kid Cudi Jay-Z Nicki Minaj
#1        Kanye West          0       0        0     1           0
#2           Beyonce          0       0        0     0           1
#3          Kid Cudi          0       0        0     0           0
#4             Jay-Z          0       0        0     0           0
#5       Nicki Minaj          0       0        0     0           0
#6 Chance The Rapper          0       0        0     0           0
#  Chance The Rapper
#1                 0
#2                 0
#3                 1
#4                 0
#5                 0
#6                 0

Sample data

# Sample data
df <- read.table(text =
    "from           to
'Kanye West'     'Jay-Z'
'Beyonce'        'Nicki Minaj'
'Kid Cudi'       'Chance The Rapper'", header = T)

How to filter a adjacency matrix by values above x

Using the superior performance of data.table, we can avoid altogether the conversions to and from an adjacency matrix.

Given a SubsMAIN dataset like the one reproduced here

structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 301950, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781),
               RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
                          5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
                          5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
                          5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
                          5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
                          5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
               AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
                             686042, 531975, 1892323204, 362579, 199781,
                             2988937092, 205270, 353623, 657993, 2418118532,
                             590804, 222936, 216022, 2320404356, 199862, 538993,
                             290046, 234885, 417532, 1705021316, 216430,
                             1320783748, 301950, 2012450692, 3267006340, 321415,
                             213839, 1967230852, 519301, 1880919940, 409850,
                             617204, 262004, 200165, 3267006340, 345500,
                             1711443844, 290046, 238184, 241451, 452301, 301950,
                             205491, 212098, 241578, 2367524740, 2366410628,
                             225252, 2988937092, 1789300612, 1965068164, 432146,
                             2151190404, 1772130180, 290046, 203622, 210929,
                             243427, 205705, 301950, 2551549828, 2250674052,
                             1378848644, 298157, 1873186692, 526355, 231243,
                             2988937092, 241578, 547653, 1301319556, 1956417412,
                             292382, 2571341700, 421709, 2309066628, 256232,
                             214201, 447962, 278848, 2533396356, 328874,
                             1955106692, 262822, 1568706436, 458913, 217003,
                             583640, 307259, 199780, 1836027780, 235786,
                             2366279556, 358714),
               STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L),
               CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10"),
               LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
                                 "2001/03/29", "2002/03/22", "2002/04/22",
                                 "2001/01/22", "2001/11/15", "2001/04/10",
                                 "2001/03/24", "2001/04/03", "2001/10/11",
                                 "2001/05/08", "2001/03/07", "2002/01/26",
                                 "2002/03/10", "2001/03/24", "2001/03/25",
                                 "2001/01/28", "2001/09/06", "2001/05/22",
                                 "2001/05/03", "2001/01/18", "2001/10/26",
                                 "2002/01/09", "2001/08/21", "2001/02/09",
                                 "2001/03/14", "2002/03/22", "2001/03/19",
                                 "2001/02/10", "2001/01/19", "2001/02/09",
                                 "2001/09/28", "2001/01/19", "2001/01/31",
                                 "2001/03/19", "2001/01/31", "2001/02/09",
                                 "2001/03/07", "2001/08/10", "2001/09/29",
                                 "2001/07/31", "2001/06/20", "2001/07/03",
                                 "2001/09/12", "2001/03/30", "2002/05/07",
                                 "2002/08/10", "2002/02/23", "2001/09/06",
                                 "2001/03/19", "2001/10/30", "2001/01/29",
                                 "2001/04/28", "2001/11/17", "2002/02/23",
                                 "2001/03/15", "2001/10/28", "2001/01/31",
                                 "2001/06/12", "2003/08/06", "2002/01/09",
                                 "2001/08/30", "2001/12/22", "2001/08/21",
                                 "2001/04/16", "2001/11/15", "2002/05/03",
                                 "2001/03/15", "2001/08/29", "2001/09/12",
                                 "2001/11/17", "2001/10/04", "2001/08/20",
                                 "2001/08/21", "2001/11/17", "2003/08/06",
                                 "2001/04/03", "2001/07/22", "2001/02/11",
                                 "2001/09/12", "2001/07/03", "2001/05/11",
                                 "2002/01/09", "2001/03/05", "2001/07/10",
                                 "2003/06/25", "2001/02/18", "2001/03/27",
                                 "2001/06/06", "2002/08/11", "2001/04/27",
                                 "2001/02/18", "2001/08/22", "2002/02/23",
                                 "2001/10/30", "2001/07/03", "2001/06/04",
                                 "2003/04/28")),
          row.names = c(NA, 100L),
          class = "data.frame")

the following data.table solution

library(data.table)

# ...
# Code to generate your dataset 'SubsMAIN'.
# ...

# Set your cutoff for the minimum number of transactions.
x <- 3

# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
  # Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
  # 'AUTHOR_ID'.
  , Pair_ID := .GRP,
    # To make the relationship symmetric, pair by the MAX and MIN of the two
    # original IDs, rather than by their column order.
    by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
  # Mark each transaction with the tally of all transactions for its pair.
  , Tally := .N, by = Pair_ID][
    # Include only those transactions whose tallies meet the cutoff.
    Tally >= x,
    # Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
    -c("Tally")]

# View results.
results

should yield results like this

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12

where every transaction from SubsMAIN is preserved, so long as it belongs to a pairing (Pair_ID) of MEMBER_ID and AUTHOR_ID having at least x transactions.

Note

For reference, here are the tallies that would be in the Tally column:

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID   # Tally
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11   #     4
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12   #     3
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23   #     3
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11   #     4
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23   #     3
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11   #     4
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12   #     3
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23   #     3
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11   #     4
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12   #     3

See how that 1st row, with a MEMBER_ID of 301950 and an AUTHOR_ID of 199781, is tallied together with the 4th, 6th, and 9th rows; which each have the reverse: a MEMBER_ID of 199781 and an AUTHOR_ID of 301950. That is, our Pair_ID (here 11) has preserved the symmetry requested here.

Now because the sample SubsMAIN had no pairings that tallied 5 (or more) transactions, I lowered the cutoff to x <- 3. That way, at least some transactions could make the cut, and there would be some output to display.

For your full dataset, free to change the cutoff to x <- 5, or whatever value you want.