pandas unique values multiple columns
pd.unique
returns the unique values from an input array, or DataFrame column or index.
The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)
Note that ravel()
is an array method that returns a view (if possible) of a multidimensional array. The argument 'K'
tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.
An alternative way is to select the columns and pass them to np.unique
:
>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)
There is no need to use ravel()
here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique
as it uses a sort-based algorithm rather than a hashtable to identify unique values.
The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop
How to select distinct across multiple data frame columns in pandas?
You can use the drop_duplicates
method to get the unique rows in a DataFrame:
In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5
In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5
You can also provide the subset
keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.
Unique values of two columns for pandas dataframe
You need groupby
+ size
+ Series.reset_index
:
df = df.groupby(['Col1', 'Col2']).size().reset_index(name='Freq')
print (df)
Col1 Col2 Freq
0 1 1 1
1 1 2 3
2 3 4 2
Python pandas create multiple columns based on unique values of one column
here is one way to do it
df.pivot(index=['account_id','conversions'], columns='campaign_objective', values='campaign_spend')
campaign_objective brand sales
account_id conversions
1 25 50 100
2 12 60 80
with reset_index
df.pivot(index=['account_id','conversions'], columns='campaign_objective', values='campaign_spend').reset_index()
campaign_objective account_id conversions brand sales
0 1 25 50 100
1 2 12 60 80
Replace unique values of multiple columns with a reference
Since mapping does not matter we can use np.unique
to get the unique values from multiple columns and zip
with AlternativeNames
to create a mapper, then DataFrame.replace
to apply the mapping:
AlternativeNames = ["Batman", "Superman", "Spiderman", "Batman's butler"]
mapper = dict(zip(np.unique(df[['col1', 'col2']]), AlternativeNames))
df = df.replace(mapper)
df
:
col1 col2
0 Superman Batman
1 Spiderman Superman
2 Batman's butler Spiderman
mapper
:
{
'Alfred Pennyworth': 'Batman',
'Bruce Wayne': 'Superman',
'Clark Kent': 'Spiderman',
'Peter Parker': "Batman's butler"
}
DataFrame and imports:
import numpy as np
import pandas as pd
data = {'col1': ["Bruce Wayne", "Clark Kent", "Peter Parker"],
'col2': ["Alfred Pennyworth", "Bruce Wayne", "Clark Kent"]}
df = pd.DataFrame(data=data)
How to get unique values from multiple columns in a pandas groupby
You can do it with apply
:
import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))
Generate id of unique values from two columns in pandas
Stack
the columns to reshape, then factorize
to encode the categorical values as numbers finally unstack
and join
with original dataframe:
s = df[['orig', 'dest']].stack()
s[:] = s.factorize()[0] + 1
s.unstack(1).add_suffix('_id').join(df)
orig_id dest_id orig dest count
0 1 2 INOA AFXR 100
1 2 1 AFXR INOA 50
2 3 1 GUTR INOA 1
3 4 5 AREB GAPR 5
pandas dataframe group by multiple columns and count distinct values
Combine value_counts
with apply
to do it per column:
df.apply(pd.value_counts)
Related Topics
Python Generator That Groups Another Iterable into Groups of N
Python's in (_Contains_) Operator Returns a Bool Whose Value Is Neither True Nor False
If Two Variables Point to the Same Object, Why Doesn't Reassigning One Variable Affect the Other
Python Webdriver to Handle Pop Up Browser Windows Which Is Not an Alert
How to Print a Dictionary's Key
Django-Registration & Django-Profile, Using Your Own Custom Form
Using Django Database Layer Outside of Django
How to Fix "Webdriverexception: Message: Connection Refused"
Difference Between Python3 and Python3M Executables
Why Do I Get "Pickle - Eoferror: Ran Out of Input" Reading an Empty File
How to Get Current Function into a Variable
Understanding .Get() Method in Python
Recursive Definitions in Pandas
Reading Data from a CSV File in Python
Typeerror: Unhashable Type: 'List' When Using Built-In Set Function