Performant Cartesian Product (Cross Join) With Pandas

Performant cartesian product (CROSS JOIN) with pandas

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

pandas <= 1.1.X

def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

pandas >= 1.2

left.merge(right, how="cross") # implements the technique above
  col1_x  col2_x col1_y  col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer

These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50

np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
col1 col2
s1 A 1
s2 B 2
s1 C 3

right2
col1 col2
x X 20
y Y 30
y Z 50

np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True


Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

Sample Image

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Performance Benchmarking Code

This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)

for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()




Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

  • Merging basics - basic types of joins

  • Index-based joins

  • Generalizing to multiple DataFrames

  • Cross join *

* you are here

cartesian product in pandas

In recent versions of Pandas (>= 1.2) this is built into merge so you can do:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})

df1.merge(df2, how='cross')

This is equivalent to the previous pandas < 1.2 answer but is easier to read.


For pandas < 1.2:

If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

Output:

   col1  col2  col3
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6

See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html

Optimizing cartesian product between two Pandas Dataframe

Of all the alternatives tested, the one that gave me the best results was the following:

  1. An iteration product was made with
    itertools.product().

  2. All the iterations on both iterrows were performed on a Pool of
    parallel processes (using a map function).

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

In the cython_modules.pyx file:

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Then I set up the setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")

Finally I compiled it with: python setup.py build_ext --inplace

The final code was left, then:

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

The solution with Dask can be found in my other question.

Join two dataframes to get cartesian product

You can use a dummy column to merge on:

df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy', how='outer').drop('dummy', axis=1)

Output:

   values_x  values_y
0 4 7
1 4 8
2 4 9
3 5 7
4 5 8
5 5 9
6 6 7
7 6 8
8 6 9

Pandas Equivalent of SQL CROSS JOIN (Cartesian Product)

A standard idiom is using the merge on a dummy column.

df1.assign(foo=1).merge(df2.assign(foo=1)).drop('foo', 1)

col1 col2 col3 col4
0 0 1 5 6
1 0 1 7 8
2 2 3 5 6
3 2 3 7 8

pandas two dataframe cross join

For the cross product, see this question.

Essentially, you have to do a normal merge but give every row the same key to join on, so that every row is joined to each other across the frames.

You can then add a column to the new frame by applying your function:

new_df = pd.merge(df1, df2, on=key)
new_df.new_col = new_df.apply(lambda row: myfunc(row['A_x'], row['A_y']), axis=1)

axis=1 forces .apply to work across the rows. 'A_x' and 'A_y' will be the default column names in the resulting frame if the merged frames share a column like in your example above.

Cross join of three dataframes

You can merge in two steps. For example for March:

tmp = pd.merge(january_df, february_df, on='ID')
final_df = pd.merge(tmp, march_df, on='ID', how='right')[['January', 'February', 'March', 'Product_no', 'Label', 'ID']].fillna(0)

print(final_df)

Prints:

   January  February  March Product_no    Label    ID
0 1.0 2.0 5 T1 Towel 1005
1 0.0 0.0 1 E1 Earring 1006
2 3.0 4.0 1 S1 Shoe 1002
3 1.0 1.0 1 B3 Bag 1004
4 0.0 0.0 1 L1 Lotion 1007
5 4.0 3.0 1 B1 Ball 1000

pandas cross join no columns in common

IIUC you need merge with temporary columns tmp of both DataFrames:

import pandas as pd

df1 = pd.DataFrame({'fld1': ['x', 'y'],
'fld2': ['a', 'b1']})


df2 = pd.DataFrame({'fld3': ['y', 'x', 'y'],
'fld4': ['a', 'b1', 'c2']})

print df1
fld1 fld2
0 x a
1 y b1

print df2
fld3 fld4
0 y a
1 x b1
2 y c2

df1['tmp'] = 1
df2['tmp'] = 1

df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
print df
fld1 fld2 fld3 fld4
0 x a y a
1 x a x b1
2 x a y c2
3 y b1 y a
4 y b1 x b1
5 y b1 y c2

Cartesian product of a DataFrame and list

Check with Performant cartesian product (CROSS JOIN) with pandas

newdf=df.assign(key=1).merge(pd.DataFrame({'key':[1]*len(b),'v':b})).drop('key',1)

Pandas cross join dataframe and series

You can add common column to both for cross join with Series.to_frame for convert Series to one column DataFrame:

df = s.to_frame().assign(a=1).merge(df.assign(a=1), on='a').drop('a', axis=1)
print (df)
buzz foo bar
0 3 1 A
1 3 2 B
2 3 3 C
3 4 1 A
4 4 2 B
5 4 3 C
6 5 1 A
7 5 2 B
8 5 3 C
9 6 1 A
10 6 2 B
11 6 3 C

Or:

df = df.assign(a=1).merge(s.to_frame().assign(a=1), on='a').drop('a', axis=1)
print (df)
foo bar buzz
0 1 A 3
1 1 A 4
2 1 A 5
3 1 A 6
4 2 B 3
5 2 B 4
6 2 B 5
7 2 B 6
8 3 C 3
9 3 C 4
10 3 C 5
11 3 C 6

Another idea is use numpy.tile and numpy.repeat:

df = (df.loc[np.tile(df.index, len(s))]
.reset_index(drop=True)
.assign(buzz = np.repeat(s.to_numpy(), len(df))))
print (df)
foo bar buzz
0 1 A 3
1 2 B 3
2 3 C 3
3 1 A 4
4 2 B 4
5 3 C 4
6 1 A 5
7 2 B 5
8 3 C 5
9 1 A 6
10 2 B 6
11 3 C 6

EDIT: If all columns are same types (not like here) is possible use @Ch3ster solution, thank you:

Using np.tile and np.repeat

df_arr = np.tile(df.to_numpy(), (len(s), 1))
s_arr = np.repeat(s.to_numpy(), len(df))
df = pd.DataFrame(
np.column_stack([df_arr, s_arr]), columns=[*df.columns, s.name]
)


Related Topics



Leave a reply



Submit