How do I calculate the co-occurrence in the table?
Looking at your desired output, you are looking for the crossprod
function:
crossprod(table(test1))
# product
# product p1 p2 p3 p4
# p1 4 1 1 0
# p2 1 3 1 0
# p3 1 1 4 1
# p4 0 0 1 2
This is the same as crossprod(table(test1$user, test1$product))
(reflecting Dennis's comment).
How to calculate a (co-)occurrence matrix from a data frame with several columns using R?
There may be better ways to do this, but try:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
How to create a co-occurrence matrix calculated from combinations by ID/row in R?
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
Pandas/SQL co-occurrence count
Only because the question is well written and it seemed like a nice puzzle, here's some magic.
Potentially you'll have to store a lot of data, so you need to compress the frame as much as possible and do several passes through the base. If the database contains not primitive objects, convert those into integers, if you do multiprocessing, the dataframe will be copied into subprocesses, so keeping it contents small helps.
The runtime depends on the length of the dataframe but also on the number of unique stores, unique products and the size of a chunk of pairs to count. Spreading the work to many subprocesses can speed up things but there is constant cost to all the functions which will accumulate. For example, pandas' own methods will run faster on a single ten thousand rows dataframe than on a dozen of thousand row frames. And when you're running nested calls on sub dataframes of unpredictable size things get complicated. You'll probably have to experiment a bit to find a chunksize with optimal speed\memory usage.
Test runtimes with smaller numbers first. Including less shops and products. That being said, this is not a quick task. On high end machine it completes in about ten minutes.
import pandas as pd, numpy as np
df = pd.DataFrame({
'store':np.random.randint(0,int(2e4),int(5e6)),
'product':np.random.randint(0,int(5e4),int(5e6))
}).sort_values('store')
products = df['product'].unique()
N, chunksize, Ntop = len(products), int(1e4), 200
dtype = np.min_scalar_type(max(products.max(),N))
df = df.astype(dtype)
def store_cats(df):
df = df.astype('category')
cats = [df[x].cat.categories for x in df.columns]
for col in df.columns:
df[col] = df[col].cat.codes
return df, cats
def restore_cats(summary,cats):
for col in ['product_x','product_y']:
summary[col] = pandas.Categorical.from_codes(summary[col], cats)
def subsets(n = chunksize):
n = int(n)
res = [frozenset(products[i:i+n]) for i in range(0,N,n)]
info = 'In total there will be {:.1E} pairs, per pass {:.1E} will be checked, thats up to around {} mb per pass, {} passes'
print(info.format((N**2),(n*N),(n*N*3*8/1e6),len(res)))
return res
def count(df,subset):
res = df.merge(df,on = 'store')\
.query('(product_x < product_y) and product_x in @subset')\
.groupby(['product_x','product_y'])\
.count()\
.astype(dtype)\
.reset_index()
return res
def one_pass(gr,subset):
per_group = gr.apply(count,subset)
total_counts = per_group.sort_values(['product_x','product_y'])\
.groupby(['product_x','product_y'])\
.agg('sum')\
.sort_values('store',ascending=False)[:Ntop]\
.copy().reset_index()
return total_counts
def merge_passes(dfs):
res = pd.concat(dfs,ignore_index=True)
res = res.append(res.rename(columns={'product_x':'product_y','product_y':'product_x'}),ignore_index=True)
res = res.sort_values('store',ascending=False)[:Ntop]
return res
from concurrent.futures import as_completed, ProcessPoolExecutor as Pool
gr = df.groupby('store',as_index = False)
def worker(subset):
return one_pass(gr,subset)
def run_progress(max_workers=2,chunksize=chunksize):
from tqdm.auto import tqdm
with Pool(max_workers = max_workers) as p:
futures = [p.submit(worker,subset) for subset in subsets(chunksize)]
summaries = [x.result() for x in tqdm(as_completed(futures),total=len(futures))]
return merge_passes(summaries)
Compute co-occurences in pandas dataframe for column values grouped by another column values
Optimized solution
In the end, I managed to compute cross occurrences in a memory friendly way using scipy sparse matrices for the intermediate computations:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
def df_compute_cooccurrences(df: pd.DataFrame, column1: str, column2: str) -> pd.DataFrame:
# pd.factorize encode the object as an enumerated type or categorical variable, returning:
# - `codes` (ndarray): an integer ndarray that’s an indexer into `uniques`.
# - `uniques` (ndarray, Index, or Categorical): the unique valid values
# see more at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html
i, rows = pd.factorize(df[column1])
# i -> array([ 0, 0, 0, ..., 449054, 0, 1])
# rows -> Index(['column1_label1', 'column1_label2', ...])
j, cols = pd.factorize(df[column2])
# j -> array([ 0, 1, 2, ..., 28544, -1, -1])
# cols -> Float64Index([column2_label1, column2_label2, ...])
ij, tups = pd.factorize(list(zip(i, j)))
# ij -> array([ 0, 1, 2, ..., 2878026, 2878027, 2878028])
# tups -> array([(0, 0), (0, 1), (0, 2), ..., (449054, 28544), (0, -1), (1, -1)]
# Then we can finally compute the crosstabulation matrix
crosstab = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
# If we convert directly this into a Dataframe with
# pd.DataFrame.sparse.from_spmatrix(crosstab, rows, cols)
# we have the same result as using
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
# but we obtained it in a memory-friendly way (allowing big data processing)
# In order to obtain the co-occurrences matrix for column 1,
# we have to multiply the crosstab matrix for its transposed
coocc = crosstab.dot(crosstab.transpose())
# Then we can finally return the co-occurence matrix in in a DataFrame form
return pd.DataFrame.sparse.from_spmatrix(coocc, rows, rows)
Here a small example is provided:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
def df_compute_cooccurrences(df: pd.DataFrame, column1: str, column2: str) -> pd.DataFrame:
i, rows = pd.factorize(df[column1])
j, cols = pd.factorize(df[column2])
ij, tups = pd.factorize(list(zip(i, j)))
crosstab = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
coocc = crosstab.dot(crosstab.transpose())
return pd.DataFrame.sparse.from_spmatrix(coocc, rows, rows)
df = pd.DataFrame(zip([1,1,1,2,2,3,4],["a","a","a","a","a","b","b"]), columns=list('xy'))
"""
df:
+-----+-----+
¦ x ¦ y ¦
+-----+-----+
| 1 | a |
| 1 | a |
| 1 | a |
| 2 | a |
| 2 | a |
| 3 | b |
| 4 | b |
+-----+-----+
"""
cooc_df = df_compute_cooccurrences(df, "x", "y")
"""
cooc_df:
+---+---+---+---+
¦ 1 | 2 | 3 | 4 |
+---+---+---+---+---+
¦ 1 ¦ 9 | 6 | 0 | 0 |
¦ 2 ¦ 6 | 4 | 0 | 0 |
¦ 3 ¦ 0 | 0 | 1 | 1 |
¦ 4 ¦ 0 | 0 | 1 | 1 |
+---+---+---+---+---+
"""
cooc_df2 = df_compute_cooccurrences(df, "y", "x")
"""
cooc_df2:
+----+----+
¦ a ¦ b ¦
+---+----+----+
¦ a ¦ 13 | 0 |
¦ b ¦ 0 | 2 |
+---+----+----+
"""
Sql Query: co-occurrence of column values
select t1.col2 as item_A
,t2.col2 as item_B
,count(*) as cnt
from mytable t1
join mytable t2
on t1.col1 = t2.col1
where t1.col2 < t2.col2
group by t1.col2
,t2.col2
+--------+--------+-----+
| item_a | item_b | cnt |
+--------+--------+-----+
| item1 | item2 | 1 |
| item1 | item3 | 1 |
| item1 | item4 | 1 |
| item2 | item3 | 2 |
+--------+--------+-----+
Related Topics
Extracting the Last N Characters from a String in R
How to See the Source Code of R .Internal or .Primitive Function
Overlay Normal Curve to Histogram in R
How to Use an Image as a Point in Ggplot
How to Replace Na With Mean by Group/Subset
How to Tell What Is in One Vector and Not Another
How to Merge 2 Vectors Alternating Indexes
Unique Rows, Considering Two Columns, in R, Without Order
Selecting Only Numeric Columns from a Data Frame
Put Stars on Ggplot Barplots and Boxplots - to Indicate the Level of Significance (P-Value)
Conditionally Change Panel Background With Facet_Grid
Ggplot, Facet, Piechart: Placing Text in the Middle of Pie Chart Slices
Table of Interactions - Case With Pets and Houses
Tools For Making Latex Tables in R
Change the Blank Cells to "Na"