Identifying Duplicate Columns in a Dataframe

python pandas remove duplicate columns

Here's a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy()

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy()

To remove duplicates by checking values without transposing

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:

# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312))) 


#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs

# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]

# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()

Find all duplicate columns in a collection of data frames

pd.Series.duplicated

Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:

# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])

# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()

print(res)
array(['b', 'e'], dtype=object)

pd.Series.value_counts

Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:

s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()

res = s[s > 1].index

print(res)
Index(['e', 'b'], dtype='object')

collections.Counter

The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.

from itertools import chain
from collections import Counter

c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))

res = [k for k, v in c.items() if v > 1]

Identifying the columns having duplicate column value with Different column name in python

Do you mean by:

s = df.T.duplicated().reset_index()
vals = s.loc[s[0], 'index'].tolist()
colk = df.columns.drop(vals)
print(vals)
print(colk)
print(df.drop(vals, axis=1))

Output:

['name_dup', 'age_dup']
['id', 'name', 'age']
  id name  age
0  1    A    1
1  2    B    2
2  2    B    2
3  3    C    3
4  3    D    3

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

How to identify that dataframe has duplicate column names in pandas?

If all of the columns with an additional .1 are not meant to be with .1, you could try:

print(len(df.columns) != len(df.columns.str.replace('.1$', '').drop_duplicates()))

Output:

True

With dataframes where the columns are not duplicated, it would give False.

Note: It gives True for dataframes with duplicate columns and gives False for dataframes without duplicate columns.

How do you filter duplicate columns in a dataframe based on a value in another column

IIUC, you want to keep all rows if Code is not equal to 10 but drop the first of duplicates otherwise, right? Then you could add that into the boolean mask:

cols = ['NID', 'Lact', 'Code']
out = df[~df.duplicated(cols, keep=False) | df.duplicated(cols) | df['Code'].ne(10)]

Output:

  NID  Lact  Code
2   1     1     0
3   1     1    10
4   1     2     0
5   2     2     0
6   2     2    10
7   1     1     0