Remove Duplicate Rows of a Matrix or Dataframe

Removing duplicate rows from data frame in R

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the pmin(A, B) and pmax(A,B), if the number of rows is greater than 1, we get the first row or else return the rows.

 library(data.table)
 setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
 #   A B prob
 #1: 1 2  0.1
 #2: 1 3  0.2
 #3: 1 4  0.3
 #4: 2 3  0.1
 #5: 2 4  0.4

Or we can just used duplicated on the pmax, pmin output to return a logical index and subset the data based on that.

 setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
 #   A B prob
 #1: 1 2  0.1
 #2: 1 3  0.2
 #3: 1 4  0.3
 #4: 2 3  0.1
 #5: 2 4  0.4

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

How to remove duplicate rows in both using a condition in R

You can use this code:

dff[!duplicated(t(apply(cbind(paste(dff$RES1,dff$VAL1),paste(dff$RES2,dff$VAL2)),1,sort))),]

Equivalent unrolled code:

v1 <- paste(dff$RES1,dff$VAL1)
v2 <- paste(dff$RES2,dff$VAL2)
mx <- cbind(v1,v2)
mxSorted <- t(apply(mx,1,sort))
duped <- duplicated(mxSorted)
dff[!duped,]

Explanation:

1) we create two character vectors v1, v2 by concatenating columns RES1-VAL1 and RES2-VAL2 (note that paste uses a space as default separator, maybe you could use another character or string to be safer (e.g. |,@,; etc...)

Result:

> v1
[1] "A 3" "B 5" "A 3" "A 6" "B 8"
> v2
[1] "B 5" "A 3" "A 7" "B 2" "A 7"

2) we bind these two vectors to form a matrix using cbind;

Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "B 5" "A 3"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "B 8" "A 7"

3) we sort the values of each row of the matrix using t(apply(mx,1,sort));

by sorting the rows, we simply make identical the rows having the same values just swapped (note that final transpose is necessary since apply function always returns results on the columns).

Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "A 3" "B 5"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "A 7" "B 8"

4) calling duplicated on a matrix, we get a logical vector of length = nrow(matrix), being TRUE where a row is a duplicate of a previous row, so in our case, we get:

[1] FALSE  TRUE FALSE FALSE FALSE
# i.e. the second row is a duplicate

5) finally we use this vector to filter the rows of the data.frame, getting the final result:

  RES1 VAL1 RES2 VAL2
1    A    3    B    5
3    A    3    A    7
4    A    6    B    2
5    B    8    A    7

Remove rows by duplicate column(s) values

Use np.unique on the sliced array with return_index param over axis=0, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.

So, with a as the input array, it would be -

a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]

Sample run to break down the steps and hopefully make things clear -

In [29]: a
Out[29]: 
array([[ -4,   5,   9,  30,  50,  80],
       [  2,  -6,   9,  34,  12,   7],
       [ -4,   5,   9,  98, -21,  80],
       [  5,  -9,   0,  32,  18,   0]])

In [30]: a_slice = a[:,[0,1,2,5]]

In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)

In [32]: final_output = a[unq_row_indices]

In [33]: final_output
Out[33]: 
array([[-4,  5,  9, 30, 50, 80],
       [ 2, -6,  9, 34, 12,  7],
       [ 5, -9,  0, 32, 18,  0]])

Remove duplicated rows

just isolate your data frame to the columns you need, then use the unique function :D

# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them, 
# so they're duplicates and thrown out.

What can do to find and remove semi-duplicate rows in a matrix?

You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.

DM = as.matrix(dist(x))
diag(DM) = 1            ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
    row col
8     8   5
5     5   8
16   16  11
11   11  16
48   48  20
20   20  48
168 168  71
91   91  73
73   73  91
71   71 168

This finds the "close" points that you created and a few others that got generated at random.

Remove duplicate rows ignoring column order? in R

df <- data.frame(
    var1 = c("a", "b", "a", "c", "b", "c"), 
    var2 = c("b", "a", "c", "a", "c", "b"), 
    value = c(0.576, 0.576, 0.987, 0.987, 0.034, 0.034)
)

A one-liner base-r solution:

df_unique <- df[!duplicated(apply(df[,1:2], 1, function(row) paste(sort(row), collapse=""))),]

df_unique
  var1 var2 value
1    a    b 0.576
3    a    c 0.987
5    b    c 0.034

What it does: work across the first 2 columns row-wise (apply with MARGIN = 1), sort (alphabetically) the content, paste into a single string, remove all indices where the string has already occurred before (!duplicated).

Another (probably better) approach, stepping back, is to take your original matrix and clear out the bottom half using lower.tri. This way only half of the combinations will have non-0 values:

mat <- matrix(c(0, 0.576, 0.987, 0.576, 0, 0.034, 0.987, 0.034, 0), 
              nrow=3, dimnames=list(letters[1:3], letters[1:3]))

mat[lower.tri(mat, diag = TRUE)] <- NA
mat
   a     b     c
a NA 0.576 0.987
b NA    NA 0.034
c NA    NA    NA

R: how to remove duplicate rows by column

df[!duplicated(df[ , c("id","gender")]),]

#     id  gender  variant
#  1   1  Female     a
#  3   1   Male      c
#  4   2  Female     d
#  5   2   Male      e

Another way of doing this using subset as below:

subset(df, !duplicated(subset(df, select=c(id, gender))))

#   id  gender variant
# 1  1  Female     a
# 3  1    Male     c
# 4  2  Female     d
# 5  2    Male     e