Python Pandas Remove Duplicate Columns

python pandas remove duplicate columns

Here's a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy()

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy()

To remove duplicates by checking values without transposing

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:

# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))

#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs

# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]

# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()

Fast method for removing duplicate columns in pandas.Dataframe

You may use np.unique to get indices of unique columns, and then use .iloc:

>>> df
A A B B
0 5 5 10 10
1 6 6 19 19
>>> _, i = np.unique(df.columns, return_index=True)
>>> df.iloc[:, i]
A B
0 5 10
1 6 19

How to remove duplicate columns as rows for pandas df

IIUC, you can use a manual reshaping with a MultiIndex:

cols = ['a', 'b']

out = (df
.set_index(cols)
.pipe(lambda d: d.set_axis(d.columns.str.split('_dup', expand=True), axis=1))
.stack()
.droplevel(-1).reset_index()
)

output:

       a    b  c  d
0 hello bye 1 5
1 hello bye 2 6
2 hello bye 3 7
3 hello bye 4 8

used input:

       a    b  c  c_dup1  c_dup2  c_dup3  d  d_dup1  d_dup2  d_dup3
0 hello bye 1 2 3 4 5 6 7 8

For a programmatic way of getting a/b as the only columns that do not have an equivalent with '_dup', you can use:

import re
target = df.columns.str.extract('(.*)_dup', expand=False).dropna().unique()
# Index(['c', 'd'], dtype='object')

regex = fr"^({'|'.join(map(re.escape, target))})"
# ^(c|d)

cols = list(df.columns[~df.columns.str.contains(regex)])
# ['a', 'b']

NB. there might be limitations if there are overlapping prefixes (e.g. ABC/ABCD)

removing duplicate column values from pandas dataframe

Create a mask checking if the column is equal to itself shifted, then fill the missing values

cols = [x for x in df.columns if x.startswith('col')]

#@AndyL. points out this equivalent mask is far simpler
m = df[cols].ne(df[cols].shift())

df[cols] = df[cols].astype('O').where(m).fillna('')


                       date  field1  field2 col1 col2 col3 col5
0 20200508062904.8340+0530 11 22 2 3 3 4
1 20200508062904.8340+0530 12 23
2 20200508062904.8340+0530 13 22
3 20200508062904.8340+0530 14 24
4 20200508051804.8340+0530 14 24 5
5 20200508051804.8340+0530 14 24 4 4
6 20200508051804.8340+0530 14 24 3

Previously used the unnecessarily complicated mask:

m = ~df[cols].ne(df[cols].shift()).cumsum().apply(pd.Series.duplicated)

Remove duplicates when values are swapped in columns and give a count

IIUC, you could use a frozenset as grouper:

group = df[['Col1', 'Col2']].agg(frozenset, axis=1)

(df
.groupby(group, as_index=False) # you can also group by [group, 'Score']
.agg(**{c: (c, 'first') for c in df},
Duplicates=('Score', 'count'),
)
)

output:

  Col1 Col2  Score  Duplicates
0 A B 0.6 3
1 A C 0.8 2
2 D E 0.9 1

Removing columns containing duplicated data from a pandas dataframe?

You can do that with DataFrame.duplicated, use keep in order to keep the first or last duplicated columns:

df.loc[:,~df.T.duplicated(keep='first')]

Column A Column B Column D Column E
0 1.0 7 13 13
1 2.0 8 14 13
2 3.0 9 15 13
3 4.0 10 16 13
4 NaN 11 17 13
5 6.0 12 1 13

remove duplicate columns from pandas read excel dataframe

IIUC , you can first remove the numbers after the dot and then keep only the last duplicates:

df.loc[:,~df.columns.str.replace('\.\d+','').duplicated(keep='last')]


Related Topics



Leave a reply



Submit