Pandas Unique Values Multiple Columns

pandas unique values multiple columns

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method that returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.


An alternative way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

How to select distinct across multiple data frame columns in pandas?

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5

In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

Unique values of two columns for pandas dataframe

You need groupby + size + Series.reset_index:

df = df.groupby(['Col1', 'Col2']).size().reset_index(name='Freq')
print (df)
Col1 Col2 Freq
0 1 1 1
1 1 2 3
2 3 4 2

Python pandas create multiple columns based on unique values of one column

here is one way to do it

df.pivot(index=['account_id','conversions'], columns='campaign_objective', values='campaign_spend')

     campaign_objective     brand   sales
account_id conversions
1 25 50 100
2 12 60 80

with reset_index

df.pivot(index=['account_id','conversions'], columns='campaign_objective', values='campaign_spend').reset_index()
campaign_objective  account_id  conversions     brand   sales
0 1 25 50 100
1 2 12 60 80

Replace unique values of multiple columns with a reference

Since mapping does not matter we can use np.unique to get the unique values from multiple columns and zip with AlternativeNames to create a mapper, then DataFrame.replace to apply the mapping:

AlternativeNames = ["Batman", "Superman", "Spiderman", "Batman's butler"]
mapper = dict(zip(np.unique(df[['col1', 'col2']]), AlternativeNames))
df = df.replace(mapper)

df:

              col1       col2
0 Superman Batman
1 Spiderman Superman
2 Batman's butler Spiderman

mapper:

{
'Alfred Pennyworth': 'Batman',
'Bruce Wayne': 'Superman',
'Clark Kent': 'Spiderman',
'Peter Parker': "Batman's butler"
}

DataFrame and imports:

import numpy as np
import pandas as pd

data = {'col1': ["Bruce Wayne", "Clark Kent", "Peter Parker"],
'col2': ["Alfred Pennyworth", "Bruce Wayne", "Clark Kent"]}
df = pd.DataFrame(data=data)

How to get unique values from multiple columns in a pandas groupby

You can do it with apply:

import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))

Generate id of unique values from two columns in pandas

Stack the columns to reshape, then factorize to encode the categorical values as numbers finally unstack and join with original dataframe:

s = df[['orig', 'dest']].stack()
s[:] = s.factorize()[0] + 1

s.unstack(1).add_suffix('_id').join(df)


  orig_id dest_id  orig  dest  count
0 1 2 INOA AFXR 100
1 2 1 AFXR INOA 50
2 3 1 GUTR INOA 1
3 4 5 AREB GAPR 5

pandas dataframe group by multiple columns and count distinct values

Combine value_counts with apply to do it per column:

df.apply(pd.value_counts)


Related Topics



Leave a reply



Submit