How to Calculate the Co-Occurrence in the Table

How do I calculate the co-occurrence in the table?

Looking at your desired output, you are looking for the crossprod function:

crossprod(table(test1))
# product
# product p1 p2 p3 p4
# p1 4 1 1 0
# p2 1 3 1 0
# p3 1 1 4 1
# p4 0 0 1 2

This is the same as crossprod(table(test1$user, test1$product)) (reflecting Dennis's comment).

How to calculate a (co-)occurrence matrix from a data frame with several columns using R?

There may be better ways to do this, but try:

library(tidyverse)

df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)

df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()

diag(df1) <- df_diag

df1

Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1

How to create a co-occurrence matrix calculated from combinations by ID/row in R?

DATA

I modified your data so that data can represent your actual situation.

#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>


df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))

UPDATE

After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA. 

melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo

# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2

# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().

crossprod(table(foo2[, c(1,3)])) -> mymat

# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.

diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4

Pandas/SQL co-occurrence count

Only because the question is well written and it seemed like a nice puzzle, here's some magic.

Potentially you'll have to store a lot of data, so you need to compress the frame as much as possible and do several passes through the base. If the database contains not primitive objects, convert those into integers, if you do multiprocessing, the dataframe will be copied into subprocesses, so keeping it contents small helps.

The runtime depends on the length of the dataframe but also on the number of unique stores, unique products and the size of a chunk of pairs to count. Spreading the work to many subprocesses can speed up things but there is constant cost to all the functions which will accumulate. For example, pandas' own methods will run faster on a single ten thousand rows dataframe than on a dozen of thousand row frames. And when you're running nested calls on sub dataframes of unpredictable size things get complicated. You'll probably have to experiment a bit to find a chunksize with optimal speed\memory usage.

Test runtimes with smaller numbers first. Including less shops and products. That being said, this is not a quick task. On high end machine it completes in about ten minutes.

import pandas as pd, numpy as np
df = pd.DataFrame({
'store':np.random.randint(0,int(2e4),int(5e6)),
'product':np.random.randint(0,int(5e4),int(5e6))
}).sort_values('store')

products = df['product'].unique()
N, chunksize, Ntop = len(products), int(1e4), 200
dtype = np.min_scalar_type(max(products.max(),N))
df = df.astype(dtype)

def store_cats(df):
df = df.astype('category')
cats = [df[x].cat.categories for x in df.columns]
for col in df.columns:
df[col] = df[col].cat.codes
return df, cats
def restore_cats(summary,cats):
for col in ['product_x','product_y']:
summary[col] = pandas.Categorical.from_codes(summary[col], cats)

def subsets(n = chunksize):
n = int(n)
res = [frozenset(products[i:i+n]) for i in range(0,N,n)]
info = 'In total there will be {:.1E} pairs, per pass {:.1E} will be checked, thats up to around {} mb per pass, {} passes'
print(info.format((N**2),(n*N),(n*N*3*8/1e6),len(res)))
return res

def count(df,subset):
res = df.merge(df,on = 'store')\
.query('(product_x < product_y) and product_x in @subset')\
.groupby(['product_x','product_y'])\
.count()\
.astype(dtype)\
.reset_index()
return res
def one_pass(gr,subset):
per_group = gr.apply(count,subset)
total_counts = per_group.sort_values(['product_x','product_y'])\
.groupby(['product_x','product_y'])\
.agg('sum')\
.sort_values('store',ascending=False)[:Ntop]\
.copy().reset_index()
return total_counts
def merge_passes(dfs):
res = pd.concat(dfs,ignore_index=True)
res = res.append(res.rename(columns={'product_x':'product_y','product_y':'product_x'}),ignore_index=True)
res = res.sort_values('store',ascending=False)[:Ntop]
return res

from concurrent.futures import as_completed, ProcessPoolExecutor as Pool

gr = df.groupby('store',as_index = False)
def worker(subset):
return one_pass(gr,subset)
def run_progress(max_workers=2,chunksize=chunksize):
from tqdm.auto import tqdm
with Pool(max_workers = max_workers) as p:
futures = [p.submit(worker,subset) for subset in subsets(chunksize)]
summaries = [x.result() for x in tqdm(as_completed(futures),total=len(futures))]
return merge_passes(summaries)

Compute co-occurences in pandas dataframe for column values grouped by another column values

Optimized solution

In the end, I managed to compute cross occurrences in a memory friendly way using scipy sparse matrices for the intermediate computations:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

def df_compute_cooccurrences(df: pd.DataFrame, column1: str, column2: str) -> pd.DataFrame:

# pd.factorize encode the object as an enumerated type or categorical variable, returning:
# - `codes` (ndarray): an integer ndarray that’s an indexer into `uniques`.
# - `uniques` (ndarray, Index, or Categorical): the unique valid values
# see more at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

i, rows = pd.factorize(df[column1])
# i -> array([ 0, 0, 0, ..., 449054, 0, 1])
# rows -> Index(['column1_label1', 'column1_label2', ...])

j, cols = pd.factorize(df[column2])
# j -> array([ 0, 1, 2, ..., 28544, -1, -1])
# cols -> Float64Index([column2_label1, column2_label2, ...])

ij, tups = pd.factorize(list(zip(i, j)))
# ij -> array([ 0, 1, 2, ..., 2878026, 2878027, 2878028])
# tups -> array([(0, 0), (0, 1), (0, 2), ..., (449054, 28544), (0, -1), (1, -1)]

# Then we can finally compute the crosstabulation matrix
crosstab = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
# If we convert directly this into a Dataframe with
# pd.DataFrame.sparse.from_spmatrix(crosstab, rows, cols)
# we have the same result as using
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
# but we obtained it in a memory-friendly way (allowing big data processing)

# In order to obtain the co-occurrences matrix for column 1,
# we have to multiply the crosstab matrix for its transposed
coocc = crosstab.dot(crosstab.transpose())

# Then we can finally return the co-occurence matrix in in a DataFrame form
return pd.DataFrame.sparse.from_spmatrix(coocc, rows, rows)

Here a small example is provided:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

def df_compute_cooccurrences(df: pd.DataFrame, column1: str, column2: str) -> pd.DataFrame:
i, rows = pd.factorize(df[column1])
j, cols = pd.factorize(df[column2])
ij, tups = pd.factorize(list(zip(i, j)))
crosstab = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
coocc = crosstab.dot(crosstab.transpose())
return pd.DataFrame.sparse.from_spmatrix(coocc, rows, rows)

df = pd.DataFrame(zip([1,1,1,2,2,3,4],["a","a","a","a","a","b","b"]), columns=list('xy'))
"""
df:
+-----+-----+
¦ x ¦ y ¦
+-----+-----+
| 1 | a |
| 1 | a |
| 1 | a |
| 2 | a |
| 2 | a |
| 3 | b |
| 4 | b |
+-----+-----+
"""
cooc_df = df_compute_cooccurrences(df, "x", "y")
"""
cooc_df:
+---+---+---+---+
¦ 1 | 2 | 3 | 4 |
+---+---+---+---+---+
¦ 1 ¦ 9 | 6 | 0 | 0 |
¦ 2 ¦ 6 | 4 | 0 | 0 |
¦ 3 ¦ 0 | 0 | 1 | 1 |
¦ 4 ¦ 0 | 0 | 1 | 1 |
+---+---+---+---+---+
"""
cooc_df2 = df_compute_cooccurrences(df, "y", "x")
"""
cooc_df2:
+----+----+
¦ a ¦ b ¦
+---+----+----+
¦ a ¦ 13 | 0 |
¦ b ¦ 0 | 2 |
+---+----+----+
"""

Sql Query: co-occurrence of column values

select    t1.col2  as item_A
,t2.col2 as item_B
,count(*) as cnt

from mytable t1
join mytable t2
on t1.col1 = t2.col1

where t1.col2 < t2.col2

group by t1.col2
,t2.col2

+--------+--------+-----+
| item_a | item_b | cnt |
+--------+--------+-----+
| item1 | item2 | 1 |
| item1 | item3 | 1 |
| item1 | item4 | 1 |
| item2 | item3 | 2 |
+--------+--------+-----+


Related Topics



Leave a reply



Submit