Removing duplicate rows from data frame in R
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by the pmin(A, B)
and pmax(A,B)
, if
the number of rows is greater than 1, we get the first row or else
return the rows.
library(data.table)
setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
Or we can just used duplicated
on the pmax
, pmin
output to return a logical index and subset the data based on that.
setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
How to remove duplicate rows in both using a condition in R
You can use this code:
dff[!duplicated(t(apply(cbind(paste(dff$RES1,dff$VAL1),paste(dff$RES2,dff$VAL2)),1,sort))),]
Equivalent unrolled code:
v1 <- paste(dff$RES1,dff$VAL1)
v2 <- paste(dff$RES2,dff$VAL2)
mx <- cbind(v1,v2)
mxSorted <- t(apply(mx,1,sort))
duped <- duplicated(mxSorted)
dff[!duped,]
Explanation:
1) we create two character vectors v1
, v2
by concatenating columns RES1-VAL1 and RES2-VAL2 (note that paste
uses a space as default separator, maybe you could use another character or string to be safer (e.g. |
,@
,;
etc...)
Result:
> v1
[1] "A 3" "B 5" "A 3" "A 6" "B 8"
> v2
[1] "B 5" "A 3" "A 7" "B 2" "A 7"
2) we bind these two vectors to form a matrix using cbind
;
Result:
[,1] [,2]
[1,] "A 3" "B 5"
[2,] "B 5" "A 3"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "B 8" "A 7"
3) we sort the values of each row of the matrix using t(apply(mx,1,sort))
;
by sorting the rows, we simply make identical the rows having the same values just swapped (note that final transpose is necessary since apply
function always returns results on the columns).
Result:
[,1] [,2]
[1,] "A 3" "B 5"
[2,] "A 3" "B 5"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "A 7" "B 8"
4) calling duplicated
on a matrix, we get a logical vector of length = nrow(matrix), being TRUE where a row is a duplicate of a previous row, so in our case, we get:
[1] FALSE TRUE FALSE FALSE FALSE
# i.e. the second row is a duplicate
5) finally we use this vector to filter the rows of the data.frame, getting the final result:
RES1 VAL1 RES2 VAL2
1 A 3 B 5
3 A 3 A 7
4 A 6 B 2
5 B 8 A 7
Remove rows by duplicate column(s) values
Use np.unique
on the sliced array with return_index
param over axis=0
, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.
So, with a
as the input array, it would be -
a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]
Sample run to break down the steps and hopefully make things clear -
In [29]: a
Out[29]:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
In [30]: a_slice = a[:,[0,1,2,5]]
In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)
In [32]: final_output = a[unq_row_indices]
In [33]: final_output
Out[33]:
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])
Remove duplicated rows
just isolate your data frame to the columns you need, then use the unique function :D
# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.
What can do to find and remove semi-duplicate rows in a matrix?
You will need to set a threshold, but you can just compute the distance between each row using dist
and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.
Remove duplicate rows ignoring column order? in R
df <- data.frame(
var1 = c("a", "b", "a", "c", "b", "c"),
var2 = c("b", "a", "c", "a", "c", "b"),
value = c(0.576, 0.576, 0.987, 0.987, 0.034, 0.034)
)
A one-liner base-r
solution:
df_unique <- df[!duplicated(apply(df[,1:2], 1, function(row) paste(sort(row), collapse=""))),]
df_unique
var1 var2 value
1 a b 0.576
3 a c 0.987
5 b c 0.034
What it does: work across the first 2 columns row-wise (apply
with MARGIN = 1
), sort
(alphabetically) the content, paste
into a single string, remove all indices where the string has already occurred before (!duplicated
).
Another (probably better) approach, stepping back, is to take your original matrix and clear out the bottom half using lower.tri
. This way only half of the combinations will have non-0 values:
mat <- matrix(c(0, 0.576, 0.987, 0.576, 0, 0.034, 0.987, 0.034, 0),
nrow=3, dimnames=list(letters[1:3], letters[1:3]))
mat[lower.tri(mat, diag = TRUE)] <- NA
mat
a b c
a NA 0.576 0.987
b NA NA 0.034
c NA NA NA
R: how to remove duplicate rows by column
df[!duplicated(df[ , c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Another way of doing this using subset
as below:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Related Topics
R Shiny Widgetfunc() Warning Messages with Eventreactive(Warning 1) and Renderdatatable (Warning 2)
Is There a Difference Between the R Functions Fitted() and Predict()
Installing Ggplot2 Package on Ubuntu
Regression with Heteroskedasticity Corrected Standard Errors
How to Run a High Pass or Low Pass Filter on Data Points in R
How to Sum Data.Frame Column Values
Arranging Rows in Custom Order Using Dplyr
Ggplot2 PDF Import in Adobe Illustrator Missing Font Adobepistd
Linking Intel's Math Kernel Library (Mkl) to R on Windows
Significance Level Added to Matrix Correlation Heatmap Using Ggplot2
Hyperlink Bar Chart in Highcharter
Fixing a Multiple Warning "Unknown Column"
How to Add Se Error Bars to My Barplot in Ggplot2