﻿ Removing Duplicate Combinations (Irrespective of Order) - ITCodar

# Removing Duplicate Combinations (Irrespective of Order)

## Removing duplicate combinations (irrespective of order)

Sort within the rows first, then use duplicated, see below:

``# example data    dat = matrix(scan('data.txt'), ncol = 3, byrow = TRUE)# Read 90 itemsdat[ !duplicated(apply(dat, 1, sort), MARGIN = 2), ]#       [,1] [,2] [,3]#  [1,]    1    2    3#  [2,]    1    2    4#  [3,]    1    2    5#  [4,]    1    3    4#  [5,]    1    3    5#  [6,]    1    4    5#  [7,]    2    3    4#  [8,]    2    3    5#  [9,]    2    4    5# [10,]    3    4    5``

## Removing duplicate all-way-combinations while retaining all columns

Here's a base solution, using the `complete.cases` function, and also creating a sorted `feedID` column:

``# remove any rows with NA valuestest <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]#remove any rows with feedID == feedID2test <- test[!(test\$feedID == test\$feedID2),]# add new feedID3 columntest\$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))# remove any duplicates, and remove last columntest[!duplicated(test[,c('feedID3', 'ID')]), -4]   ID feedID feedID22 49V     A1      G26 52V     B1      D17 52V     D1      D2``

### data

Note that we have converted `"NA"` to `NA`, and we have also set `stringsAsFactors = TRUE`

``test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),                   stringsAsFactors = FALSE)``

## Remove duplicate combinations in R

``df[!duplicated(t(apply(df[c("a", "b")], 1, sort))), ]  a b c1 1 4 A2 2 3 B3 1 5 C``

Where:

``df <- data.frame(  a = c(1L, 2L, 1L, 4L, 5L, 3L, 3L),   b = c(4L, 3L, 5L, 1L, 1L, 2L, 2L),   c = c("A", "B", "C", "A", "C", "B", "E"))``

## How to find duplicated combination were order does not matter in excel

For exact 4 columns and up to 1000 rows:

``{=IF(SUM(IF(MMULT({1,1,1,1},TRANSPOSE(COUNTIF(\$A1:\$D1,\$A\$1:\$D\$1000)))=4,1))>1,"duplicate","unique")}``

This is an array formula. Input it into `E1` without the curly brackets. Then press [Ctrl]+[Shift]+[Enter] to confirm.

Copy downwards as needed.

If it not works, please check the language version of your Excel and the locale of your Windows. Maybe the array constant `{1,1,1,1}` in my formula must be written as `{1\1\1\1}` or `{1.1.1.1}` because the comma will be in conflict with the decimal separator or list delimiter.

## Remove duplicates across columns

We can `sort` the elements in each `row` with `apply`, `t`ranspose the output, apply `duplicated` to return a logical vector and use that for subsetting the rows

``df[!duplicated(t(apply(df[, 1:2], 1, sort))),]#     [,1] [,2]#[1,] "a"  "b" #[2,] "a"  "c" #[3,] "a"  "d" #[4,] "b"  "c" #[5,] "b"  "d" #[6,] "c"  "d" ``

or another option is `pmin/pmax`

``df[!duplicated(cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))),]``

### data

``df <- structure(c("a", "a", "a", "b", "b", "b", "c", "c", "c", "b", "c", "d", "a", "c", "d", "a", "b", "d"), .Dim = c(9L, 2L))``

## SQL Remove duplicate combination

If you have other columns and the pairs only appear once (in either direction):

``select t.*from twhere t.x1 <= t.x2union allselect t.*from twhere t.x1 > t.x2 and      not exists (select 1 from t t2 where t2.x1 = t.x2 and t2.x2 = t.x1);``

## Delete duplicated rows with same values but in different column in R

One option would be to use a least/greatest trick, and then remove duplicates:

``library(SparkR)df <- unique(cbind(least(df\$A, df\$B), greatest(df\$A, df\$B)))``

Here is a base R version of the above:

``df <- unique(cbind(ifelse(df\$A < df\$B, df\$A, df\$B),                   ifelse(df\$A >= df\$B, df\$A, df\$B)))``

## Unique case of finding duplicate values flexibly across columns in R

tidyverse

``df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),                 predation_type = c("eats", "eats", "eaten by", "eats"),                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))library(tidyverse)df %>%   rowwise() %>%   mutate(duplicates = str_c(sort(c_across(c(1, 3))), collapse = "")) %>%   group_by(duplicates) %>%   mutate(duplicates = n() > 1) %>%   ungroup()#> # A tibble: 4 x 4#>   animal_1 predation_type animal_2 duplicates#>   <chr>    <chr>          <chr>    <lgl>     #> 1 cat      eats           mouse    TRUE      #> 2 dog      eats           squirrel FALSE     #> 3 mouse    eaten by       cat      TRUE      #> 4 squirrel eats           nuts     FALSE``

Created on 2022-01-17 by the reprex package (v2.0.1)

removing duplicates

``library(tidyverse)df %>%   filter(!duplicated(map2(animal_1, animal_2, ~str_c(sort((c(.x, .y))), collapse = ""))))#>   animal_1 predation_type animal_2#> 1      cat           eats    mouse#> 2      dog           eats squirrel#> 3 squirrel           eats     nuts``

Created on 2022-01-17 by the reprex package (v2.0.1)

## Remove Duplicates Based on Combined Sets

One idea is to treat each long/lat pair as a string `toString(...)` - sort the two long/lat pairs (now strings) per row - then sort the resulting 2-element string vector. Use the sorted vector of strings to check for duplicates

``ans <- C[!duplicated(lapply(1:nrow(C), function(i) sort(c(toString(C[i,1:2]), toString(C[i,3:4]))))), ]  # A_Latitude A_Longitude B_Latitude B_Longitude# 1    48.4459      9.9890    49.0275      8.7539# 2    48.7000      8.1500    48.4734      9.2270# 4    49.0275      8.7539    48.9602      9.2058``

Here's a breakdown for row 1

``toString(C[1,1:2])# [1] "48.4459, 9.989"toString(C[1,3:4])# [1] "49.0275, 8.7539"sort(c(toString(C[1,1:2]), toString(C[1,3:4])))# [1] "48.4459, 9.989"  "49.0275, 8.7539"``

## Finding unique combinations irrespective of position

Maybe something like that

``indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rowsdf[indx, ] # selects only the non - duplicates according to that index#   a b c# 1 1 2 3# 3 3 1 4``