Creating an Adjacency Matrix in R
We can assign an id column for Case
and get the data in wide format. Then use count
to count how many times the combination of Plaintiff
and Defendant
occur.
library(dplyr)
library(tidyr)
df %>%
mutate(Case = dense_rank(Case)) %>%
pivot_wider(names_from = Status, values_from = Person) %>%
count(Plaintiff, Defendant)
# Plaintiff Defendant n
# <int> <int> <int>
#1 11111 44444 1
#2 22222 99999 1
#3 33333 44444 1
#4 44444 99999 1
#5 55555 66666 1
#6 77777 22222 1
#7 99999 11111 1
#8 99999 88888 1
Create adjacency matrix from a path given as a vector of nodes in base R
You can use the powerful but little-known trick of matrix-based indexing:
index_mat <- rbind(
c(1, 2),
c(2, 3),
c(3, 1)
)
mat <- matrix(FALSE, 3, 3)
mat[index_mat] <- TRUE
mat
[,1] [,2] [,3]
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE TRUE
[3,] TRUE FALSE FALSE
So do this:
path_to_D <- function (path, p) {
indices <- cbind(path[-length(path)], path[-1])
D <- matrix(0, p, p)
D[indices] <- 1
D
}
D <- path_to_D(hypothesis_path, 25)
which(D == 1, arr.ind=TRUE)
row col
[1,] 6 1
[2,] 7 6
[3,] 17 7
Convert data frame into adjacency matrix format in R
This sounds like a textbook case for using a pivot. There are many ways to do this in R. Here's a base R solution:
#Your example data:
original <- data.frame(Sample_A = c("Rabit", "Mouse", "Tiger", "Mouse", "Lion"),
Sample_B = c("Mouse", "Tiger", "Lion", "Rabit", "Rabit"),
Value = c(1.2, 7.89, -0.9, 1.2, 98.13))
#Pivot:
with(original, tapply(Value, list(Sample_A, Sample_B), FUN = identity))
Output:
Lion Mouse Rabit Tiger
Lion NA NA 98.13 NA
Mouse NA NA 1.20 7.89
Rabit NA 1.2 NA NA
Tiger -0.9 NA NA NA
adjacency matrix R
Here are two different possibilities:
Method 1
Using igraph::get.adjacency
# Convert data.frame to igraph
library(igraph);
ig <- graph_from_data_frame(df);
# Get adjacency matrix
get.adjacency(ig);
#6 x 6 sparse Matrix of class "dgCMatrix"
# Kanye West Beyonce Kid Cudi Jay-Z Nicki Minaj
#Kanye West . . . 1 .
#Beyonce . . . . 1
#Kid Cudi . . . . .
#Jay-Z . . . . .
#Nicki Minaj . . . . .
#Chance The Rapper . . . . .
# Chance The Rapper
#Kanye West .
#Beyonce .
#Kid Cudi 1
#Jay-Z .
#Nicki Minaj .
#Chance The Rapper .
Method 2
Using reshape2::dcast
library(reshape2);
df$from <- factor(df$from, levels = unique(as.character(unlist(df))))
df$to <- factor(df$to, levels = unique(as.character(unlist(df))))
dcast(df, from ~ to, fun.aggregate = length, drop = F);
# from Kanye West Beyonce Kid Cudi Jay-Z Nicki Minaj
#1 Kanye West 0 0 0 1 0
#2 Beyonce 0 0 0 0 1
#3 Kid Cudi 0 0 0 0 0
#4 Jay-Z 0 0 0 0 0
#5 Nicki Minaj 0 0 0 0 0
#6 Chance The Rapper 0 0 0 0 0
# Chance The Rapper
#1 0
#2 0
#3 1
#4 0
#5 0
#6 0
Sample data
# Sample data
df <- read.table(text =
"from to
'Kanye West' 'Jay-Z'
'Beyonce' 'Nicki Minaj'
'Kid Cudi' 'Chance The Rapper'", header = T)
How to filter a adjacency matrix by values above x
Using the superior performance of data.table
, we can avoid altogether the conversions to and from an adjacency matrix.
Given a SubsMAIN
dataset like the one reproduced here
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 301950, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781, 199781, 199781,
199781, 199781, 199781, 199781),
RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
686042, 531975, 1892323204, 362579, 199781,
2988937092, 205270, 353623, 657993, 2418118532,
590804, 222936, 216022, 2320404356, 199862, 538993,
290046, 234885, 417532, 1705021316, 216430,
1320783748, 301950, 2012450692, 3267006340, 321415,
213839, 1967230852, 519301, 1880919940, 409850,
617204, 262004, 200165, 3267006340, 345500,
1711443844, 290046, 238184, 241451, 452301, 301950,
205491, 212098, 241578, 2367524740, 2366410628,
225252, 2988937092, 1789300612, 1965068164, 432146,
2151190404, 1772130180, 290046, 203622, 210929,
243427, 205705, 301950, 2551549828, 2250674052,
1378848644, 298157, 1873186692, 526355, 231243,
2988937092, 241578, 547653, 1301319556, 1956417412,
292382, 2571341700, 421709, 2309066628, 256232,
214201, 447962, 278848, 2533396356, 328874,
1955106692, 262822, 1568706436, 458913, 217003,
583640, 307259, 199780, 1836027780, 235786,
2366279556, 358714),
STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L),
CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10", "2001/01/10", "2001/01/10",
"2001/01/10"),
LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
"2001/03/29", "2002/03/22", "2002/04/22",
"2001/01/22", "2001/11/15", "2001/04/10",
"2001/03/24", "2001/04/03", "2001/10/11",
"2001/05/08", "2001/03/07", "2002/01/26",
"2002/03/10", "2001/03/24", "2001/03/25",
"2001/01/28", "2001/09/06", "2001/05/22",
"2001/05/03", "2001/01/18", "2001/10/26",
"2002/01/09", "2001/08/21", "2001/02/09",
"2001/03/14", "2002/03/22", "2001/03/19",
"2001/02/10", "2001/01/19", "2001/02/09",
"2001/09/28", "2001/01/19", "2001/01/31",
"2001/03/19", "2001/01/31", "2001/02/09",
"2001/03/07", "2001/08/10", "2001/09/29",
"2001/07/31", "2001/06/20", "2001/07/03",
"2001/09/12", "2001/03/30", "2002/05/07",
"2002/08/10", "2002/02/23", "2001/09/06",
"2001/03/19", "2001/10/30", "2001/01/29",
"2001/04/28", "2001/11/17", "2002/02/23",
"2001/03/15", "2001/10/28", "2001/01/31",
"2001/06/12", "2003/08/06", "2002/01/09",
"2001/08/30", "2001/12/22", "2001/08/21",
"2001/04/16", "2001/11/15", "2002/05/03",
"2001/03/15", "2001/08/29", "2001/09/12",
"2001/11/17", "2001/10/04", "2001/08/20",
"2001/08/21", "2001/11/17", "2003/08/06",
"2001/04/03", "2001/07/22", "2001/02/11",
"2001/09/12", "2001/07/03", "2001/05/11",
"2002/01/09", "2001/03/05", "2001/07/10",
"2003/06/25", "2001/02/18", "2001/03/27",
"2001/06/06", "2002/08/11", "2001/04/27",
"2001/02/18", "2001/08/22", "2002/02/23",
"2001/10/30", "2001/07/03", "2001/06/04",
"2003/04/28")),
row.names = c(NA, 100L),
class = "data.frame")
the following data.table
solution
library(data.table)
# ...
# Code to generate your dataset 'SubsMAIN'.
# ...
# Set your cutoff for the minimum number of transactions.
x <- 3
# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
# Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
# 'AUTHOR_ID'.
, Pair_ID := .GRP,
# To make the relationship symmetric, pair by the MAX and MIN of the two
# original IDs, rather than by their column order.
by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
# Mark each transaction with the tally of all transactions for its pair.
, Tally := .N, by = Pair_ID][
# Include only those transactions whose tallies meet the cutoff.
Tally >= x,
# Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
-c("Tally")]
# View results.
results
should yield results
like this
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID
1: 301950 5 199781 0 2001/01/10 2001/04/03 11
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12
3: 199781 5 290046 0 2001/01/10 2001/01/18 23
4: 199781 5 301950 0 2001/01/10 2002/03/22 11
5: 199781 5 290046 0 2001/01/10 2001/06/20 23
6: 199781 5 301950 0 2001/01/10 2002/05/07 11
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12
8: 199781 5 290046 0 2001/01/10 2001/06/12 23
9: 199781 5 301950 0 2001/01/10 2001/08/21 11
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12
where every transaction from SubsMAIN
is preserved, so long as it belongs to a pairing (Pair_ID
) of MEMBER_ID
and AUTHOR_ID
having at least x
transactions.
Note
For reference, here are the tallies that would be in the Tally
column:
MEMBER_ID RATING AUTHOR_ID STATUS CREATION LAST_MODIFIED Pair_ID # Tally
1: 301950 5 199781 0 2001/01/10 2001/04/03 11 # 4
2: 199781 5 2988937092 1 2001/01/10 2001/10/11 12 # 3
3: 199781 5 290046 0 2001/01/10 2001/01/18 23 # 3
4: 199781 5 301950 0 2001/01/10 2002/03/22 11 # 4
5: 199781 5 290046 0 2001/01/10 2001/06/20 23 # 3
6: 199781 5 301950 0 2001/01/10 2002/05/07 11 # 4
7: 199781 5 2988937092 1 2001/01/10 2001/04/28 12 # 3
8: 199781 5 290046 0 2001/01/10 2001/06/12 23 # 3
9: 199781 5 301950 0 2001/01/10 2001/08/21 11 # 4
10: 199781 5 2988937092 0 2001/01/10 2001/10/04 12 # 3
See how that 1
st row, with a MEMBER_ID
of 301950
and an AUTHOR_ID
of 199781
, is tallied together with the 4
th, 6
th, and 9
th rows; which each have the reverse: a MEMBER_ID
of 199781
and an AUTHOR_ID
of 301950
. That is, our Pair_ID
(here 11
) has preserved the symmetry requested here.
Now because the sample SubsMAIN
had no pairings that tallied 5
(or more) transactions, I lowered the cutoff to x <- 3
. That way, at least some transactions could make the cut, and there would be some output to display.
For your full dataset, free to change the cutoff to x <- 5
, or whatever value you want.
Related Topics
Dplyr Without Hard-Coding the Variable Names
Double Clustered Standard Errors for Panel Data
Varying Axis Labels Formatter Per Facet in Ggplot/R
Output a Good-Looking Matrix Using Rendertable()
Plotting Ordiellipse Function from Vegan Package Onto Nmds Plot Created in Ggplot2
Nested If Else Statements Over a Number of Columns
How to Split an Igraph into Connected Subgraphs
Weird Characters Added to First Column Name After Reading a Toad-Exported CSV File
Partially Color Histogram in R
Sum of Two Columns of Data Frame with Na Values
Find Locations Within Certain Lat/Lon Distance in R
Large-Scale Regression in R with a Sparse Feature Matrix
Lapply to Add Columns to Each Dataframe in a List
Convert Sequence of Longitude and Latitude to Polygon via Sf in R