Python Pandas Dataframe Get All Combinations of Column Values

Python pandas dataframe get all combinations of column values?

Here is a timeit comparison of a few alternatives.

| method             | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |

The timeit results were generated using IPython:

In [138]: df = make_df(20)

In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop

import numpy as np
import pandas as pd


def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]

result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result

def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result

def orig(df):
combos = []
points = []

for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary

point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2

combos.append(combo)
points.append(new_points)

return pd.DataFrame({'color':combos, 'points':points})

def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2

def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))

new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]

def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df

The main idea in alt2 is to use np.add_outer to construct an addition table
out of df['points']:

In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])

ravel is used to make the array 1-dimensional:

In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])

and the color combinations are generated with pd.MultiIndex.from_product:

In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')

A mask is generated to remove duplicates:

mask = color.labels[0] != color.labels[1]

and then the result is generated from these parts:

 result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]

The idea behind alt is explained in my original answer, here.

How to create all combination of values from a single pandas column?

I think you are looking for permutations, not combinations. In this case we can generate those and transpose the data. After the transpose we can rename the columns.

import pandas as pd
from itertools import permutations
df = pd.DataFrame({'Name':['A','B','C','D'],
'Data':[1,2,1,1.5]})

df = pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T
df.columns = [f'Data{x+1}' for x in df.columns]

df.reset_index(inplace=True)
df.rename(columns={'index':'Name'}, inplace=True)

Or:

pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T.add_prefix('Data').rename_axis('Name').reset_index()

Output

  Name  Data1  Data2  Data3  Data4  Data5  Data6  Data7  Data8  Data9  ...  \
0 A 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 ...
1 B 2.0 2.0 1.0 1.0 1.5 1.5 1.0 1.0 1.0 ...
2 C 1.0 1.5 2.0 1.5 2.0 1.0 1.0 1.5 1.0 ...
3 D 1.5 1.0 1.5 2.0 1.0 2.0 1.5 1.0 1.5 ...

How to get all combinations of three columns in a dataframe in pandas python>

Use itertools.product:

from  itertools import product
#all columns
df = pd.DataFrame(list(product(*df.values.T)))
#if you need to specify columns
#df = pd.DataFrame(list(product(*[df.a, df.b, df.c])))
print (df)

Is there an optimal way to get all combinations of values in a grouped pandas dataframe?

I think you can do a self merge and query:

df.merge(df, on='ID', suffixes=[1,2]).query('color1 != color2')

Or similar, merge then filter:

(df.merge(df, on='ID', suffixes=[1,2])
.loc[lambda x: x['color1'] != x['color2']]
)

Output:

   ID color1 color2
1 a red blue
2 a red green
3 a blue red
5 a blue green
6 a green red
7 a green blue
10 b red blue
11 b blue red
14 c red green
15 c green red

Generate all combinations in pandas

Updated for multi values:

df = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4,'dog'],
['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3,'eel'],
['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant']],
columns=['group', 'val1', 'val2', 'animal'])


dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi[['val1', 'val2']] = dfi.groupby(level=0)[['val1','val2']].transform('max')
print(dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False]))

Output:

   group    animal  val1  val2  occurred
0 A cat 3 4 1
1 A dog 3 4 1
4 A fox 3 4 1
2 A eel 3 4 0
3 A elephant 3 4 0
5 A rabbit 3 4 0
7 B dog 2 3 1
8 B eel 2 3 1
11 B rabbit 2 3 1
6 B cat 2 3 0
9 B elephant 2 3 0
10 B fox 2 3 0
15 C elephant 6 7 1
16 C fox 6 7 1
12 C cat 6 7 0
13 C dog 6 7 0
14 C eel 6 7 0
17 C rabbit 6 7 0

IIUC, you can do it like this assign 'Observed', using set_index, create multiindex, then groupby to fill NaN.

dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi['val'] = dfi.groupby(level=0)['val'].transform('max')
dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False])

Output:

   group    animal  val  occurred
0 A fox 3 1
1 A cat 3 1
2 A dog 3 1
3 A rabbit 3 0
4 A eel 3 0
5 A elephant 3 0
8 B dog 2 1
9 B rabbit 2 1
10 B eel 2 1
6 B fox 2 0
7 B cat 2 0
11 B elephant 2 0
12 C fox 6 1
17 C elephant 6 1
13 C cat 6 0
14 C dog 6 0
15 C rabbit 6 0
16 C eel 6 0

All combinations of a row in a dataframe

You can groupby by No and create the lists you want quite easily.

def combinations(group): 
return pd.Series(list(it.combinations(group['Color'].unique(), 2)))

df.groupby('No').apply(combinations).explode()

if you don't make it explode it will return you a list of color combinations

Counting all combinations of values in multiple columns

This seems like a nice problem for pd.get_dummies:

new_df = (
pd.concat([df, pd.get_dummies(df['star'])], axis=1)
.groupby(['month', 'item'], as_index=False)
[df['star'].unique()]
.sum()
)

Output:

>>> new_df
month item 1 2 3
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1

Renaming, too:

u = df['star'].unique()
new_df = (
pd.concat([df, pd.get_dummies(df['star'])], axis=1)
.groupby(['month', 'item'], as_index=False)
[u]
.sum()
.rename({k: f'star_{k}_cnt' for k in df['star'].unique()}, axis=1)
)

Output:

>>> new_df
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1

Obligatory one- (or two-) liners:

# Renames the columns
u = df['star'].unique()
new_df = pd.concat([df, pd.get_dummies(df['star'])], axis=1).groupby(['month', 'item'], as_index=False)[u].sum().rename({k: f'star_{k}_cnt' for k in df['star'].unique()}, axis=1)

pyspark create all possible combinations of column values of a dataframe

You can use the crossJoin method, and then cull the lines with id1 > id2.

df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')


Related Topics



Leave a reply



Submit