How can I remove all duplicates so that NONE are left in a data frame?
This will extract the rows which appear only once (assuming your data frame is named df
):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated
tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE
is used, the function starts at the last line.
Boths boolean results are combined with |
(logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using !
thereby creating a boolean vector indicating lines appearing only once.
Delete all duplicated rows in R
You can use table()
to get a frequency table of your column, then use the result to subset:
singletons <- names(which(table(test$a) == 1))
test[test$a %in% singletons, ]
a b c
1 1 2 a
2 4 5 b
how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Removing duplicates from Pandas rows, replace them with NaNs, shift NaNs to end of rows
You can stack
and then drop_duplicates
that way. Then we need to pivot with the help of a cumcount
level. The stack
preserves the order the values appear in along the rows and the cumcount
ensures that the NaN
will appear in the end.
df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()
df1['col'] = df1.groupby('level_0').cumcount()
df1 = (df1.pivot(index='level_0', columns='col', values=0)
.rename_axis(index=None, columns=None))
0 1 2 3
0 A B C D
1 A D C NaN
2 C B NaN NaN
3 B A NaN NaN
Timings
Assuming 4 columns, let's see how a bunch of these methods compare as the number of rows grow. The map
and apply
solutions have a good advantage when things are small, but they become a bit slower than the more involved stack
+ drop_duplicates
+ pivot
solution as the DataFrame gets longer. Regardless, they all start to take a while for a large DataFrame.
import perfplot
import pandas as pd
import numpy as np
def stack(df):
df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()
df1['col'] = df1.groupby('level_0').cumcount()
df1 = (df1.pivot(index='level_0', columns='col', values=0)
.rename_axis(index=None, columns=None))
return df1
def apply_drop_dup(df):
return pd.DataFrame.from_dict(df.apply(lambda x: x.drop_duplicates().tolist(),
axis=1).to_dict(), orient='index')
def apply_unique(df):
return pd.DataFrame(df.apply(pd.Series.unique, axis=1).tolist())
def list_map(df):
return pd.DataFrame(list(map(pd.unique, df.values)))
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
columns=list('abcd')),
kernels=[
lambda df: stack(df),
lambda df: apply_drop_dup(df),
lambda df: apply_unique(df),
lambda df: list_map(df),
],
labels=['stack', 'apply_drop_dup', 'apply_unique', 'list_map'],
n_range=[2 ** k for k in range(18)],
equality_check=lambda x,y: x.compare(y).empty,
xlabel='~len(df)'
)
Finally, if preserving the order in which the values originally appeared within each row is unimportant, you can use numpy
. To de-duplicate you sort then check for differences. Then create an output array that shifts values to the right. Because this method will always return 4 columns, we require a dropna
to match the other output in the case that every row has fewer than 4 unique values.
def with_numpy(df):
arr = np.sort(df.to_numpy(), axis=1)
r = np.roll(arr, 1, axis=1)
r[:, 0] = np.NaN
arr = np.where((arr != r), arr, np.NaN)
# Move all NaN to the right. Credit @Divakar
mask = pd.notnull(arr)
justified_mask = np.flip(np.sort(mask, axis=1), 1)
out = np.full(arr.shape, np.NaN, dtype=object)
out[justified_mask] = arr[mask]
return pd.DataFrame(out, index=df.index).dropna(how='all', axis='columns')
with_numpy(df)
# 0 1 2 3
#0 A B C D
#1 A C D NaN
#2 B C NaN NaN # B/c this method sorts, B before C
#3 A B NaN NaN
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
columns=list('abcd')),
kernels=[
lambda df: stack(df),
lambda df: with_numpy(df),
],
labels=['stack', 'with_numpy'],
n_range=[2 ** k for k in range(3, 22)],
# Lazy check to deal with string/NaN and irrespective of sort order.
equality_check=lambda x, y: (np.sort(x.fillna('ZZ').to_numpy(), 1)
== np.sort(y.fillna('ZZ').to_numpy(), 1)).all(),
xlabel='len(df)'
)
Remove duplicated rows
just isolate your data frame to the columns you need, then use the unique function :D
# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.
Looking to remove both rows if duplicated in a column using dplyr
Here's one way using dplyr
-
df %>%
group_by(id) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 x 2
id award_amount
<chr> <dbl>
1 1-2 3000
2 1-4 5881515
3 1-5 155555
4 1-9 750000
5 1-22 3500000
Remove all rows containing the same elements from dataframe, even though elements are in different order in R
I added a little data to show that the matrix format remains
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))
dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)
mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]
mymat_duprm
Related Topics
Deleting Rows in R Based on Values Over Multiple Columns
Regex to Replace Comma to Dot Separator
Splitting a Dataframe into Several Dataframes
How to Get to the Next Line in the R Command Prompt Without Executing
Concatenate String Columns and Order in Alphabetical Order
How to Show Code But Hide Output in Rmarkdown
How to Change the Default Colors in Plotly Chart
Removing All Empty Columns and Rows in Data.Frame When Rows Don't Go Away
Add Column Values Based on Other Columns in Data Frame Using for and If
Error in Confusionmatrix the Data and Reference Factors Must Have the Same Number of Levels
Find Duplicated Elements With Dplyr
Mapping Columns/Rows from One Dataframe to Another Based on Row Number
Using Ggplot2, How to Insert a Break in the Axis
Ggplot2 Stacked Bar Chart - Each Bar Being 100% and With Percenage Labels Inside Each Bar
Multiplying All Columns in Dataframe by Single Column
Duplicating Rows in R Merge Function
Split an Audio File into Pieces of an Arbitrary Size
How to Convert a List Consisting of Vector of Different Lengths to a Usable Data Frame in R