python pandas remove duplicate columns
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy()
is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
Fast method for removing duplicate columns in pandas.Dataframe
You may use np.unique
to get indices of unique columns, and then use .iloc
:
>>> df
A A B B
0 5 5 10 10
1 6 6 19 19
>>> _, i = np.unique(df.columns, return_index=True)
>>> df.iloc[:, i]
A B
0 5 10
1 6 19
How to remove duplicate columns as rows for pandas df
IIUC, you can use a manual reshaping with a MultiIndex:
cols = ['a', 'b']
out = (df
.set_index(cols)
.pipe(lambda d: d.set_axis(d.columns.str.split('_dup', expand=True), axis=1))
.stack()
.droplevel(-1).reset_index()
)
output:
a b c d
0 hello bye 1 5
1 hello bye 2 6
2 hello bye 3 7
3 hello bye 4 8
used input:
a b c c_dup1 c_dup2 c_dup3 d d_dup1 d_dup2 d_dup3
0 hello bye 1 2 3 4 5 6 7 8
For a programmatic way of getting a/b as the only columns that do not have an equivalent with '_dup'
, you can use:
import re
target = df.columns.str.extract('(.*)_dup', expand=False).dropna().unique()
# Index(['c', 'd'], dtype='object')
regex = fr"^({'|'.join(map(re.escape, target))})"
# ^(c|d)
cols = list(df.columns[~df.columns.str.contains(regex)])
# ['a', 'b']
NB. there might be limitations if there are overlapping prefixes (e.g. ABC/ABCD)
removing duplicate column values from pandas dataframe
Create a mask checking if the column is equal to itself shifted, then fill the missing values
cols = [x for x in df.columns if x.startswith('col')]
#@AndyL. points out this equivalent mask is far simpler
m = df[cols].ne(df[cols].shift())
df[cols] = df[cols].astype('O').where(m).fillna('')
date field1 field2 col1 col2 col3 col5
0 20200508062904.8340+0530 11 22 2 3 3 4
1 20200508062904.8340+0530 12 23
2 20200508062904.8340+0530 13 22
3 20200508062904.8340+0530 14 24
4 20200508051804.8340+0530 14 24 5
5 20200508051804.8340+0530 14 24 4 4
6 20200508051804.8340+0530 14 24 3
Previously used the unnecessarily complicated mask:
m = ~df[cols].ne(df[cols].shift()).cumsum().apply(pd.Series.duplicated)
Remove duplicates when values are swapped in columns and give a count
IIUC, you could use a frozenset
as grouper:
group = df[['Col1', 'Col2']].agg(frozenset, axis=1)
(df
.groupby(group, as_index=False) # you can also group by [group, 'Score']
.agg(**{c: (c, 'first') for c in df},
Duplicates=('Score', 'count'),
)
)
output:
Col1 Col2 Score Duplicates
0 A B 0.6 3
1 A C 0.8 2
2 D E 0.9 1
Removing columns containing duplicated data from a pandas dataframe?
You can do that with DataFrame.duplicated
, use keep
in order to keep the first or last duplicated columns:
df.loc[:,~df.T.duplicated(keep='first')]
Column A Column B Column D Column E
0 1.0 7 13 13
1 2.0 8 14 13
2 3.0 9 15 13
3 4.0 10 16 13
4 NaN 11 17 13
5 6.0 12 1 13
remove duplicate columns from pandas read excel dataframe
IIUC , you can first remove the numbers after the dot and then keep only the last duplicates:
df.loc[:,~df.columns.str.replace('\.\d+','').duplicated(keep='last')]
Related Topics
"Fire and Forget" Python Async/Await
How to Display Pandas Dataframe of Floats Using a Format String for Columns
Generate 'N' Unique Random Numbers Within a Range
Scope of Lambda Functions and Their Parameters
Create Nice Column Output in Python
Does "\D" in Regex Mean a Digit
Multiprocessing Global Variable Updates Not Returned to Parent
Shooting a Bullet in Pygame in the Direction of Mouse
Create a Directly-Executable Cross-Platform Gui App Using Python
Convert a Number Range to Another Range, Maintaining Ratio
Using Pip Behind a Proxy with Cntlm
How to Generate Keyboard Events
How to Remove a Substring from the End of a String
Python 3 Importerror: No Module Named 'Configparser'
Splitting a List into N Parts of Approximately Equal Length
Why am I Seeing "Typeerror: String Indices Must Be Integers"
Python 2.7 Getting User Input and Manipulating as String Without Quotations