Finding the most frequent combination in DataFrame
You can group by the two columns together and count the number of occurrences of each pair, then sort the pairs by this count.
The following code does the job:
df.groupby(["From", "To"]).size().sort_values(ascending=False)
and, for the example of the question, it returns:
From To
-----------------------
Home Office 3
Restaurant Office 1
Airport Home 1
Python: How to find most frequent combination of elements?
Use custom function all_subsets
, then flatten values by Series.explode
and last use Series.value_counts
:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
#only converted to list and removed empty tuples by range(1,...
def all_subsets(ss):
return list(chain(*map(lambda x: combinations(ss, x), range(1, len(ss)+1))))
s = df.groupby('id')['code'].apply(all_subsets).explode().value_counts()
print (s)
(2,) 3
(2, 5) 3
(5,) 3
(1, 2) 2
(3, 6) 2
..
(1, 5, 8) 1
(9,) 1
(1, 3, 4, 6) 1
(5, 8, 9) 1
(4, 6) 1
How to list the most frequent combination of column that contain data
It think this is what you want. Instead of returning a list of columns, this returns a list or lists of columns, to account for instances where there is a tie for the 'best' number of non-NA rows.
import pandas as pd
from itertools import combinations
from math import nan
def best_combinations(df, n_cols):
best_cols = []
best_length = 0
for cols in combinations(df.columns, n_cols):
subdf = df.loc[:, list(cols)].dropna()
if len(subdf) > best_length:
best_length = len(subdf)
best_cols = [cols]
elif (len(subdf) == best_length) and (best_length > 0):
best_cols.append(cols)
return best_cols, best_length
On your dataframe:
df = pd.DataFrame({
'A': {0: '2', 1: '4', 2: '3', 3: '4', 4: '6', 5: nan, 6: nan},
'B': {0: '6', 1: '5', 2: '4', 3: '5', 4: '7', 5: nan, 6: nan},
'C': {0: '3', 1: '6', 2: nan, 3: nan, 4: nan, 5: nan, 6: nan},
'D': {0: '7', 1: '7', 2: nan, 3: nan, 4: nan, 5: '5', 6: '7'},
'E': {0: '7', 1: '5', 2: nan, 3: nan, 4: nan, 5: '6', 6: '5'},
'F': {0: '3', 1: '4', 2: nan, 3: nan, 4: nan, 5: '7', 6: '8'}}
)
best_combinations(df, 2)
# returns:
[('A', 'B')], 5
best_combinations(df, 3)
[('D', 'E', 'F')], 4
Find most frequent combination of values in a data.frame
Here's an approach with data.table
:
dt <- data.table(dat)
setkeyv(dt, names(dt))
dt[, .N, by = key(dt)]
dt[, .N, by = key(dt)][N == max(N)]
# age sex bmi N
# 1: 55 1 25 2
And an approach with base R:
x <- data.frame(table(dat))
x[x$Freq == max(x$Freq), ]
# age sex bmi Freq
# 11 55 1 25 2
I don't know how well either of these scale though, particularly if the number of combinations is going to be large. So, test back and report!
Replace x$Freq == max(x$Freq)
with which.max(x$Freq)
and N == max(N)
with which.max(N)
if you are really just interested in one row of results.
Pull most frequent combination of 2 columns from panda dataframe by count
Your error is being caused by the [0]
at the end of your line where you do the groupby
. You didn't post the full error message, but I would bet you have a KeyError: 0
. That is due to you no longer having a 0
in your index. If you take a look at the DataFrame
created after the groupby
you'll see that you now have a hierarchical index, created from the unique value combinations of column1
and column2
.
The quick solution? Replace [0]
with .iloc[0]
to get the row in the zero-index-location.
output = df.groupby(['column1','column2']).count().sort_values(by=['column1','column2'], axis = 0).iloc[0]
Or use .head(1)
, to get the top row of the DataFrame
.
Counting most common combination of values in dataframe column
Use itertools.combinations
, explode
and value_counts
import itertools
(df.groupby('ID').Product.agg(lambda x: list(itertools.combinations(x,2)))
.explode().str.join('-').value_counts())
Out[611]:
A-B 2
C-D 1
A-D 1
A-C 1
Name: Product, dtype: int64
Or:
import itertools
(df.groupby('ID').Product.agg(lambda x: list(map('-'.join, itertools.combinations(x,2))))
.explode().value_counts())
Out[597]:
A-B 2
C-D 1
A-D 1
A-C 1
Name: Product, dtype: int64
Python - pandas - find most frequent combination with tie-resolution - performance
Let us try groupby
with transform
, then get the count of most common value, then sort_values
with drop_duplicates
df['help'] = df.groupby(['id','string_col_A','string_col_B'])['string_col_A'].transform('count')
out = df.sort_values(['help','creation_date'],na_position='first').drop_duplicates('id',keep='last').drop(['help','creation_date'],1)
out
Out[122]:
id string_col_A string_col_B
3 x21ab STR_X4 STR_Y4
5 x11aa STR_X3 STR_Y3
0 x12ga STR_X1 STR_Y1
Pandas get most frequent values used together in the same column
You can use groupby.apply(set)
and then count the values with .value_counts
:
df.groupby('user_id')['Channel'].apply(set).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output
Channels_together n
0 {a, b} 2
1 {a, c, b} 1
If you want your values in str
format we can write a lambda
function to sort our set and convert it to string:
df.groupby('user_id')['Channel'].apply(lambda x: ', '.join(sorted(set(x)))).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output
Channels_together n
0 a, b 2
1 a, b, c 1
Related Topics
Missing Data When Supplying a Dual-Axis--Multiple-Traces to Subplot
R Group By, Counting Non-Na Values
How to Underline Text in a Plot Title or Label? (Ggplot2)
Convert Vector to Matrix Without Recycling
Testing a Function That Uses Enquo() for a Null Parameter
How to Load Dependencies in an R Package
Saving Dynamic UI to Global R Workspace
Ggplot2 PDF Import in Adobe Illustrator Missing Font Adobepistd
How to Prep Transaction Data into Basket for Arules
Dplyr::Select() with Some Variables That May Not Exist in the Data Frame
Removing Particular Character in a Column in R
R Shiny Loop to Display Multiple Plots
How to Make a Barplot with R from a Table
Pass R Variable to Rodbc's SQLquery with Multiple Entries