Python pandas dataframe get all combinations of column values?
Here is a timeit
comparison of a few alternatives.
| method | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |
The timeit
results were generated using IPython
:
In [138]: df = make_df(20)
In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop
import numpy as np
import pandas as pd
def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]
result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result
def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result
def orig(df):
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2
combos.append(combo)
points.append(new_points)
return pd.DataFrame({'color':combos, 'points':points})
def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2
def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]
def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df
The main idea in alt2
is to use np.add_outer
to construct an addition table
out of df['points']
:
In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])
ravel
is used to make the array 1-dimensional:
In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])
and the color combinations are generated with pd.MultiIndex.from_product
:
In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')
A mask is generated to remove duplicates:
mask = color.labels[0] != color.labels[1]
and then the result
is generated from these parts:
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
The idea behind alt
is explained in my original answer, here.
How to create all combination of values from a single pandas column?
I think you are looking for permutations, not combinations. In this case we can generate those and transpose the data. After the transpose we can rename the columns.
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'Name':['A','B','C','D'],
'Data':[1,2,1,1.5]})
df = pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T
df.columns = [f'Data{x+1}' for x in df.columns]
df.reset_index(inplace=True)
df.rename(columns={'index':'Name'}, inplace=True)
Or:
pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T.add_prefix('Data').rename_axis('Name').reset_index()
Output
Name Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 ... \
0 A 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 ...
1 B 2.0 2.0 1.0 1.0 1.5 1.5 1.0 1.0 1.0 ...
2 C 1.0 1.5 2.0 1.5 2.0 1.0 1.0 1.5 1.0 ...
3 D 1.5 1.0 1.5 2.0 1.0 2.0 1.5 1.0 1.5 ...
How to get all combinations of three columns in a dataframe in pandas python>
Use itertools.product
:
from itertools import product
#all columns
df = pd.DataFrame(list(product(*df.values.T)))
#if you need to specify columns
#df = pd.DataFrame(list(product(*[df.a, df.b, df.c])))
print (df)
Is there an optimal way to get all combinations of values in a grouped pandas dataframe?
I think you can do a self merge and query:
df.merge(df, on='ID', suffixes=[1,2]).query('color1 != color2')
Or similar, merge then filter:
(df.merge(df, on='ID', suffixes=[1,2])
.loc[lambda x: x['color1'] != x['color2']]
)
Output:
ID color1 color2
1 a red blue
2 a red green
3 a blue red
5 a blue green
6 a green red
7 a green blue
10 b red blue
11 b blue red
14 c red green
15 c green red
Generate all combinations in pandas
Updated for multi values:
df = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4,'dog'],
['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3,'eel'],
['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant']],
columns=['group', 'val1', 'val2', 'animal'])
dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi[['val1', 'val2']] = dfi.groupby(level=0)[['val1','val2']].transform('max')
print(dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False]))
Output:
group animal val1 val2 occurred
0 A cat 3 4 1
1 A dog 3 4 1
4 A fox 3 4 1
2 A eel 3 4 0
3 A elephant 3 4 0
5 A rabbit 3 4 0
7 B dog 2 3 1
8 B eel 2 3 1
11 B rabbit 2 3 1
6 B cat 2 3 0
9 B elephant 2 3 0
10 B fox 2 3 0
15 C elephant 6 7 1
16 C fox 6 7 1
12 C cat 6 7 0
13 C dog 6 7 0
14 C eel 6 7 0
17 C rabbit 6 7 0
IIUC, you can do it like this assign 'Observed', using set_index, create multiindex, then groupby to fill NaN.
dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi['val'] = dfi.groupby(level=0)['val'].transform('max')
dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False])
Output:
group animal val occurred
0 A fox 3 1
1 A cat 3 1
2 A dog 3 1
3 A rabbit 3 0
4 A eel 3 0
5 A elephant 3 0
8 B dog 2 1
9 B rabbit 2 1
10 B eel 2 1
6 B fox 2 0
7 B cat 2 0
11 B elephant 2 0
12 C fox 6 1
17 C elephant 6 1
13 C cat 6 0
14 C dog 6 0
15 C rabbit 6 0
16 C eel 6 0
All combinations of a row in a dataframe
You can groupby by No and create the lists you want quite easily.
def combinations(group):
return pd.Series(list(it.combinations(group['Color'].unique(), 2)))
df.groupby('No').apply(combinations).explode()
if you don't make it explode it will return you a list of color combinations
Counting all combinations of values in multiple columns
This seems like a nice problem for pd.get_dummies
:
new_df = (
pd.concat([df, pd.get_dummies(df['star'])], axis=1)
.groupby(['month', 'item'], as_index=False)
[df['star'].unique()]
.sum()
)
Output:
>>> new_df
month item 1 2 3
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
Renaming, too:
u = df['star'].unique()
new_df = (
pd.concat([df, pd.get_dummies(df['star'])], axis=1)
.groupby(['month', 'item'], as_index=False)
[u]
.sum()
.rename({k: f'star_{k}_cnt' for k in df['star'].unique()}, axis=1)
)
Output:
>>> new_df
month item star_1_cnt star_2_cnt star_3_cnt
0 1 10 2 1 1
1 1 20 0 0 1
2 2 20 0 2 1
Obligatory one- (or two-) liners:
# Renames the columns
u = df['star'].unique()
new_df = pd.concat([df, pd.get_dummies(df['star'])], axis=1).groupby(['month', 'item'], as_index=False)[u].sum().rename({k: f'star_{k}_cnt' for k in df['star'].unique()}, axis=1)
pyspark create all possible combinations of column values of a dataframe
You can use the crossJoin
method, and then cull the lines with id1 > id2
.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')
Related Topics
Python Selenium - Element Is Not Currently Interactable and May Not Be Manipulated
Counting Non Zero Values in Each Column of a Dataframe in Python
How to Specify File Path in Jupyter Notebook
How to Map True/False to 1/0 in a Pandas Dataframe
How to Find a Minimum Value in a 2D Array Without Using Numpy or Flattened in Python
How to Determine Whether a Pandas Column Contains a Particular Value
Passing Multiple Arguments from Django Template Href Link to View
Vary the Color of Each Bar in Bargraph Using Particular Value
How to Get Slope from Timeseries Data in Pandas
Python: How to Print Separate Lines from a List
How to Kill a While Loop With a Keystroke
Parentheses and Quotation Marks in Output
Print All Number Divisible by 7 and Contain 7 from 0 to 100
How to Calculate Average a Dictionary from List of Dictionary Data
How to Get the Sum of a CSV Column List to Print
How to Ask a Set of Questions Multiple Times Based on User Input