Performant cartesian product (CROSS JOIN) with pandas
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge
then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product
, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving @senderle's cartesian_product
is possible when dealing with just two DataFrames. Using np.broadcast_arrays
, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product
helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
cartesian product in pandas
In recent versions of Pandas (>= 1.2) this is built into merge
so you can do:
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})
df1.merge(df2, how='cross')
This is equivalent to the previous pandas < 1.2 answer but is easier to read.
For pandas < 1.2:
If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).
from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})
merge(df1, df2,on='key')[['col1', 'col2', 'col3']]
Output:
col1 col2 col3
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
Optimizing cartesian product between two Pandas Dataframe
Of all the alternatives tested, the one that gave me the best results was the following:
An iteration product was made with
itertools.product().All the iterations on both iterrows were performed on a Pool of
parallel processes (using a map function).
To give it a little more performance, the function compute_row_cython
was compiled with Cython as it is advised in this section of the Pandas documentation:
In the cython_modules.pyx
file:
from scipy.stats import pearsonr
import numpy as np
def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]
Then I set up the setup.py
:
from distutils.core import setup
from Cython.Build import cythonize
setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")
Finally I compiled it with: python setup.py build_ext --inplace
The final code was left, then:
import itertools
import multiprocessing
from cython_modules import compute_row_cython
NUM_CORES = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()
Neither Dask, nor the merge
function with the apply
used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.
The solution with Dask can be found in my other question.
Join two dataframes to get cartesian product
You can use a dummy column to merge on:
df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy', how='outer').drop('dummy', axis=1)
Output:
values_x values_y
0 4 7
1 4 8
2 4 9
3 5 7
4 5 8
5 5 9
6 6 7
7 6 8
8 6 9
Pandas Equivalent of SQL CROSS JOIN (Cartesian Product)
A standard idiom is using the merge
on a dummy column.
df1.assign(foo=1).merge(df2.assign(foo=1)).drop('foo', 1)
col1 col2 col3 col4
0 0 1 5 6
1 0 1 7 8
2 2 3 5 6
3 2 3 7 8
pandas two dataframe cross join
For the cross product, see this question.
Essentially, you have to do a normal merge but give every row the same key to join on, so that every row is joined to each other across the frames.
You can then add a column to the new frame by applying your function:
new_df = pd.merge(df1, df2, on=key)
new_df.new_col = new_df.apply(lambda row: myfunc(row['A_x'], row['A_y']), axis=1)
axis=1
forces .apply
to work across the rows. 'A_x' and 'A_y' will be the default column names in the resulting frame if the merged frames share a column like in your example above.
Cross join of three dataframes
You can merge in two steps. For example for March:
tmp = pd.merge(january_df, february_df, on='ID')
final_df = pd.merge(tmp, march_df, on='ID', how='right')[['January', 'February', 'March', 'Product_no', 'Label', 'ID']].fillna(0)
print(final_df)
Prints:
January February March Product_no Label ID
0 1.0 2.0 5 T1 Towel 1005
1 0.0 0.0 1 E1 Earring 1006
2 3.0 4.0 1 S1 Shoe 1002
3 1.0 1.0 1 B3 Bag 1004
4 0.0 0.0 1 L1 Lotion 1007
5 4.0 3.0 1 B1 Ball 1000
pandas cross join no columns in common
IIUC you need merge
with temporary columns tmp
of both DataFrames
:
import pandas as pd
df1 = pd.DataFrame({'fld1': ['x', 'y'],
'fld2': ['a', 'b1']})
df2 = pd.DataFrame({'fld3': ['y', 'x', 'y'],
'fld4': ['a', 'b1', 'c2']})
print df1
fld1 fld2
0 x a
1 y b1
print df2
fld3 fld4
0 y a
1 x b1
2 y c2
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
print df
fld1 fld2 fld3 fld4
0 x a y a
1 x a x b1
2 x a y c2
3 y b1 y a
4 y b1 x b1
5 y b1 y c2
Cartesian product of a DataFrame and list
Check with Performant cartesian product (CROSS JOIN) with pandas
newdf=df.assign(key=1).merge(pd.DataFrame({'key':[1]*len(b),'v':b})).drop('key',1)
Pandas cross join dataframe and series
You can add common column to both for cross join with Series.to_frame
for convert Series
to one column DataFrame
:
df = s.to_frame().assign(a=1).merge(df.assign(a=1), on='a').drop('a', axis=1)
print (df)
buzz foo bar
0 3 1 A
1 3 2 B
2 3 3 C
3 4 1 A
4 4 2 B
5 4 3 C
6 5 1 A
7 5 2 B
8 5 3 C
9 6 1 A
10 6 2 B
11 6 3 C
Or:
df = df.assign(a=1).merge(s.to_frame().assign(a=1), on='a').drop('a', axis=1)
print (df)
foo bar buzz
0 1 A 3
1 1 A 4
2 1 A 5
3 1 A 6
4 2 B 3
5 2 B 4
6 2 B 5
7 2 B 6
8 3 C 3
9 3 C 4
10 3 C 5
11 3 C 6
Another idea is use numpy.tile
and numpy.repeat
:
df = (df.loc[np.tile(df.index, len(s))]
.reset_index(drop=True)
.assign(buzz = np.repeat(s.to_numpy(), len(df))))
print (df)
foo bar buzz
0 1 A 3
1 2 B 3
2 3 C 3
3 1 A 4
4 2 B 4
5 3 C 4
6 1 A 5
7 2 B 5
8 3 C 5
9 1 A 6
10 2 B 6
11 3 C 6
EDIT: If all columns are same types (not like here) is possible use @Ch3ster solution, thank you:
Using np.tile
and np.repeat
df_arr = np.tile(df.to_numpy(), (len(s), 1))
s_arr = np.repeat(s.to_numpy(), len(df))
df = pd.DataFrame(
np.column_stack([df_arr, s_arr]), columns=[*df.columns, s.name]
)
Related Topics
Can't Send Input to Running Program in Sublime Text
Should I Put #! (Shebang) in Python Scripts, and What Form Should It Take
Why Does Integer Division Yield a Float Instead of Another Integer
How to Change the Order of Dataframe Columns
How to "Log In" to a Website Using Python'S Requests Module
How to Hide Output of Subprocess
Why Does Substring Slicing With Index Out of Range Work
In Python, How to Determine If an Object Is Iterable
Bare Asterisk in Function Arguments
How to Find the Cumulative Sum of Numbers in a List
What Does the "At" (@) Symbol Do in Python
Pip Install from Git Repo Branch
Python Function Global Variables
How to Add New Keys to a Dictionary