python pandas remove duplicate columns
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy()
is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
Find all duplicate columns in a collection of data frames
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated
after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter
followed by a list comprehension. Recall that list(df)
returns the columns in a dataframe, so we can use this map
and itertools.chain
to produce an iterable to feed Counter
.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
Identifying the columns having duplicate column value with Different column name in python
Do you mean by:
s = df.T.duplicated().reset_index()
vals = s.loc[s[0], 'index'].tolist()
colk = df.columns.drop(vals)
print(vals)
print(colk)
print(df.drop(vals, axis=1))
Output:
['name_dup', 'age_dup']
['id', 'name', 'age']
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3
Check for duplicate values in Pandas dataframe column
Main question
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Further reading and references
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
- drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Example to play around with
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018
How to identify that dataframe has duplicate column names in pandas?
If all of the columns with an additional .1
are not meant to be with .1
, you could try:
print(len(df.columns) != len(df.columns.str.replace('.1$', '').drop_duplicates()))
Output:
True
With dataframes where the columns are not duplicated, it would give False
.
Note: It gives True
for dataframes with duplicate columns and gives False
for dataframes without duplicate columns.
How do you filter duplicate columns in a dataframe based on a value in another column
IIUC, you want to keep all rows if Code is not equal to 10 but drop the first of duplicates otherwise, right? Then you could add that into the boolean mask:
cols = ['NID', 'Lact', 'Code']
out = df[~df.duplicated(cols, keep=False) | df.duplicated(cols) | df['Code'].ne(10)]
Output:
NID Lact Code
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0
Identifying duplicate columns in a dataframe
You can do with lapply
:
testframe[!duplicated(lapply(testframe, summary))]
summary
summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
Find all duplicate columns in a pandas dataframe and then group them by key
You could group by using all columns but the first (because it corresponds to the original columns names) and then build the expected result using a dictionary comprehension and extended iterable unpacking:
import pandas as pd
df = pd.DataFrame({'col1': [0, 1, 2, 3, 4], 'col2': [1, 0, 0, 0, 1], 'col3': [1, 0, 0, 0, 1], 'col4': [1, 0, 1, 0, 1],
'col5': [1, 0, 1, 0, 1], 'col6': [1, 1, 1, 0, 1], 'col7': [1, 0, 0, 0, 1]})
transpose = df.T
# build all column list but the first
columns = list(range(1, len(df)))
# build result iterating over groups
result = {head: tail for _, (head, *tail) in transpose.reset_index().groupby(columns).index}
print(result)
Output
{'col1': [], 'col4': ['col5'], 'col6': [], 'col2': ['col3', 'col7']}
How do I get a list of all the duplicate items using pandas in python?
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids
so many times. I prefer method #2: groupby
on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
Related Topics
Geom_Bar Bars Not Displaying When Specifying Ylim
Return Elements of List as Independent Objects in Global Environment
Create a Co-Occurrence Matrix from Dummy-Coded Observations
How to Move Cells With a Value Row-Wise to the Left in a Dataframe
Plot Correlation Matrix into a Graph
Unique on a Dataframe With Only Selected Columns
Select Multiple Columns in Data.Table by Their Numeric Indices
Ggplot2 - Jitter and Position Dodge Together
R: Use Magrittr Pipe Operator in Self Written Package
How to Subtract Months from a Date in R
Lattice: Multiple Plots in One Window
Remove Parentheses and Text Within from Strings in R
Replace All Particular Values in a Data Frame