﻿ Creating Co-Occurrence Matrix - ITCodar

# Creating Co-Occurrence Matrix

## Creating co-occurrence matrix

I'd use a combination of the reshape2 package and matrix algebra:

``#read in your datadat <- read.table(text="TrxID Items QuantTrx1 A 3Trx1 B 1Trx1 C 1Trx2 E 3Trx2 B 1Trx3 B 1Trx3 C 4Trx4 D 1Trx4 E 1Trx4 A 1Trx5 F 5Trx5 B 3Trx5 C 2Trx5 D 1", header=T)#making the boolean matrix   library(reshape2)dat2 <- melt(dat)w <- dcast(dat2, Items~TrxID)x <- as.matrix(w[,-1])x[is.na(x)] <- 0x <- apply(x, 2,  function(x) as.numeric(x > 0))  #recode as 0/1v <- x %*% t(x)                                   #the magic matrix diag(v) <- 0                                      #repalce diagonaldimnames(v) <- list(w[, 1], w[,1])                #name the dimensionsv``

For the graphing maybe...

``g <- graph.adjacency(v, weighted=TRUE, mode ='undirected')g <- simplify(g)# set labels and degrees of verticesV(g)\$label <- V(g)\$nameV(g)\$degree <- degree(g)plot(g)``

## Constructing a co-occurrence matrix in python pandas

It's a simple linear algebra, you multiply matrix with its transpose (your example contains strings, don't forget to convert them to integer):

``>>> df_asint = df.astype(int)>>> coocc = df_asint.T.dot(df_asint)>>> coocc       Dop  Snack  TransDop      4      2      3Snack    2      3      2Trans    3      2      4``

if, as in R answer, you want to reset diagonal, you can use numpy's `fill_diagonal`:

``>>> import numpy as np>>> np.fill_diagonal(coocc.values, 0)>>> coocc       Dop  Snack  TransDop      0      2      3Snack    2      0      2Trans    3      2      0``

## How to calculate a (co-)occurrence matrix from a data frame with several columns using R?

There may be better ways to do this, but try:

``library(tidyverse)df1 <- df %>%pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%xtabs(~ID + Country, data = ., sparse = FALSE) %>% crossprod(., .) df_diag <- df %>% pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%mutate(Country2 = Country) %>%xtabs(~Country + Country2, data = ., sparse = FALSE) %>% diag()diag(df1) <- df_diag df1Country   China England Greece USA  China       2       2      2   0  England     2       6      1   1  Greece      2       1      3   1  USA         0       1      1   1``

## How to create a co-occurrence matrix calculated from combinations by ID/row in R?

DATA

I modified your data so that data can represent your actual situation.

``#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6#1:  1 England England England China     USA England#2:  2 England   China   China   USA England   China#3:  3 England   China   China   USA     USA     USA#4:  4   China England England China     USA England#5:  5  Sweden    <NA>    <NA>  <NA>            <NA>df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England", "England", "China", "Sweden"), CTR2 = c("England", "China", "China", "England", NA), CTR3 = c("England", "China", "China", "England", NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA", "England", "USA", "USA", ""), CTR6 = c("England", "China", "USA", "England", NA)), class = c("data.table", "data.frame"), row.names = c(NA, -5L))``

UPDATE

After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

``# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA. melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo# Get distinct value (country) in each ID group (each row)unique(foo, by = c("ID", "value")) -> foo2# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix# Seeing this question, you want to create a matrix with crossprod().crossprod(table(foo2[, c(1,3)])) -> mymat# Finally, you need to change diagonal values. If a value is equal to one,# change it to zero. Otherwise, keep the original value.diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)#value#value     China England Sweden USA#China       4       4      0   4#England     4       4      0   4#Sweden      0       0      0   0#USA         4       4      0   4``

## Creating a co-occurence matrix

You can do this in a straight-forward way using `OneHotEncoder()` and `np.dot()`

1. Turn each element in dataframe to a string
2. Use a one-hot encoder to convert the dataframe into one-hots over a unique vocabulary of the categorical elements
3. Take a dot product with itself to get count of co-occurance
4. Recreate a dataframe using the co-occurance matrix and the `feature_names` from the one hot encoder
``#assuming this is your dataset                 0               1                2             30  (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  (2.0, 3.997]1  (-1.774, 1.145]  (-3.21, 0.533]   (2.007, 3.993]  (2.0, 3.997]``
``from sklearn.preprocessing import OneHotEncoderdf = df.astype(str) #turn each element to string#get one hot representation of the dataframel = OneHotEncoder() data = l.fit_transform(df.values)#get co-occurance matrix using a dot productco_occurance = np.dot(data.T, data)#get vocab (columns and indexes) for co-occuance matrix#get_feature_names() has a weird suffix which I am removing for better readibility herevocab = [i[3:] for i in l.get_feature_names()]#create co-occurance matrixddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)print(ddf)``
``                 (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  \(-1.774, 1.145]              2.0             2.0              1.0   (-3.21, 0.533]               2.0             2.0              1.0   (0.0166, 2.007]              1.0             1.0              1.0   (2.007, 3.993]               1.0             1.0              0.0   (2.0, 3.997]                 2.0             2.0              1.0                    (2.007, 3.993]  (2.0, 3.997]  (-1.774, 1.145]             1.0           2.0  (-3.21, 0.533]              1.0           2.0  (0.0166, 2.007]             0.0           1.0  (2.007, 3.993]              1.0           1.0  (2.0, 3.997]                1.0           2.0  ``

As you can verify from the output above, its exactly what the co-occurance matrix should be.

Advantages of this approach are that you can scale this using the `transform` method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.