Create Unique Identifier from the Interchangeable Combination of Two Variables

Create unique identifier from the interchangeable combination of two variables

You could do:

labels <- apply(df[, c("col1", "col2")], 1, sort)
df$id <- as.numeric(factor(apply(labels, 2, function(x) paste(x, collapse=""))))

Get Unique List of Combinations of Strings, regardless of Order

df$grp <- interaction(do.call(pmin, df[1:2]), do.call(pmax, df[1:2]))

df
# col1 col2 grp
# 1 a b a.b
# 2 c d c.d
# 3 g h g.h
# 4 d c c.d
# 5 e f e.f
# 6 b a a.b
# 7 f e e.f
# 8 h g g.h

If you want numbers, you can then do

df$grp <- as.integer(df$grp)

df
# col1 col2 grp
# 1 a b 1
# 2 c d 6
# 3 g h 16
# 4 d c 6
# 5 e f 11
# 6 b a 1
# 7 f e 11
# 8 h g 16

R - make a unique list of two columns interchangeably

Here is a base R way.

inx <- !duplicated(t(apply(df, 1, sort)))
df[inx, ]

One-liner:

df[!duplicated(t(apply(df, 1, sort))), ]
# col1 col2
#1 a 1
#3 bar foo

SQL to select unique combination of two columns having interchangeable values

Something like this usually works:

Select distinct case when dep < arr then dep else arr end as col1,
case when dep < arr then arr else dep end as col2
From flights

is there a way to group by two variables which interchange in R

We can create two new variables based on pmin/pmax to get the group_indices

library(dplyr)
df %>%
mutate(ID_new = pmin(ID, ID2), ID2_new = pmax(ID, ID2)) %>%
mutate(group = group_indices(., ID_new, ID2_new)) %>%
select(-ends_with('new'))
# ID ID2 group
#1 102 167 1
#2 102 167 1
#3 167 102 1
#4 143 148 2
#5 143 148 2
#6 148 143 2
#7 148 143 2

In the devel version of dplyr, we can use cur_group_id after creating a group

library(stringr)
df %>%
group_by(grp = str_c(pmin(ID, ID2), pmax(ID, ID2))) %>%
mutate(group = cur_group_id()) %>%
ungroup %>%
select(-grp)

Postgresql enforce unique two-way combination of columns

A variation on Neil's solution which doesn't need an extension is:

create table friendz (
from_id int,
to_id int
);

create unique index ifriendz on friendz(greatest(from_id,to_id), least(from_id,to_id));

Neil's solution lets you use an arbitrary number of columns though.

We're both relying on using expressions to build the index which is documented
https://www.postgresql.org/docs/current/indexes-expressional.html

Add unique constraint to combination of two columns

Once you have removed your duplicate(s):

ALTER TABLE dbo.yourtablename
ADD CONSTRAINT uq_yourtablename UNIQUE(column1, column2);

or

CREATE UNIQUE INDEX uq_yourtablename
ON dbo.yourtablename(column1, column2);

Of course, it can often be better to check for this violation first, before just letting SQL Server try to insert the row and returning an exception (exceptions are expensive).

  • Performance impact of different error handling techniques

  • Checking for potential constraint violations before entering TRY/CATCH

If you want to prevent exceptions from bubbling up to the application, without making changes to the application, you can use an INSTEAD OF trigger:

CREATE TRIGGER dbo.BlockDuplicatesYourTable
ON dbo.YourTable
INSTEAD OF INSERT
AS
BEGIN
SET NOCOUNT ON;

IF NOT EXISTS (SELECT 1 FROM inserted AS i
INNER JOIN dbo.YourTable AS t
ON i.column1 = t.column1
AND i.column2 = t.column2
)
BEGIN
INSERT dbo.YourTable(column1, column2, ...)
SELECT column1, column2, ... FROM inserted;
END
ELSE
BEGIN
PRINT 'Did nothing.';
END
END
GO

But if you don't tell the user they didn't perform the insert, they're going to wonder why the data isn't there and no exception was reported.


EDIT here is an example that does exactly what you're asking for, even using the same names as your question, and proves it. You should try it out before assuming the above ideas only treat one column or the other as opposed to the combination...

USE tempdb;
GO

CREATE TABLE dbo.Person
(
ID INT IDENTITY(1,1) PRIMARY KEY,
Name NVARCHAR(32),
Active BIT,
PersonNumber INT
);
GO

ALTER TABLE dbo.Person
ADD CONSTRAINT uq_Person UNIQUE(PersonNumber, Active);
GO

-- succeeds:
INSERT dbo.Person(Name, Active, PersonNumber)
VALUES(N'foo', 1, 22);
GO

-- succeeds:
INSERT dbo.Person(Name, Active, PersonNumber)
VALUES(N'foo', 0, 22);
GO

-- fails:
INSERT dbo.Person(Name, Active, PersonNumber)
VALUES(N'foo', 1, 22);
GO

Data in the table after all of this:

ID   Name   Active PersonNumber
---- ------ ------ ------------
1 foo 1 22
2 foo 0 22

Error message on the last insert:

Msg 2627, Level 14, State 1, Line 3
Violation of UNIQUE KEY constraint 'uq_Person'. Cannot insert duplicate key in object 'dbo.Person'.
The statement has been terminated.

Also I blogged more recently about a solution to applying a unique constraint to two columns in either order:

  • Enforce a Unique Constraint Where Order Does Not Matter

permutations with unique values

class unique_element:
def __init__(self,value,occurrences):
self.value = value
self.occurrences = occurrences

def perm_unique(elements):
eset=set(elements)
listunique = [unique_element(i,elements.count(i)) for i in eset]
u=len(elements)
return perm_unique_helper(listunique,[0]*u,u-1)

def perm_unique_helper(listunique,result_list,d):
if d < 0:
yield tuple(result_list)
else:
for i in listunique:
if i.occurrences > 0:
result_list[d]=i.value
i.occurrences-=1
for g in perm_unique_helper(listunique,result_list,d-1):
yield g
i.occurrences+=1

a = list(perm_unique([1,1,2]))
print(a)

result:

[(2, 1, 1), (1, 2, 1), (1, 1, 2)]

EDIT (how this works):

I rewrote the above program to be longer but more readable.

I usually have a hard time explaining how something works, but let me try.
In order to understand how this works, you have to understand a similar but simpler program that would yield all permutations with repetitions.

def permutations_with_replacement(elements,n):
return permutations_helper(elements,[0]*n,n-1)#this is generator

def permutations_helper(elements,result_list,d):
if d<0:
yield tuple(result_list)
else:
for i in elements:
result_list[d]=i
all_permutations = permutations_helper(elements,result_list,d-1)#this is generator
for g in all_permutations:
yield g

This program is obviously much simpler:
d stands for depth in permutations_helper and has two functions. One function is the stopping condition of our recursive algorithm, and the other is for the result list that is passed around.

Instead of returning each result, we yield it. If there were no function/operator yield we would have to push the result in some queue at the point of the stopping condition. But this way, once the stopping condition is met, the result is propagated through all stacks up to the caller. That is the purpose of

for g in perm_unique_helper(listunique,result_list,d-1): yield g
so each result is propagated up to caller.

Back to the original program:
we have a list of unique elements. Before we can use each element, we have to check how many of them are still available to push onto result_list. Working with this program is very similar to permutations_with_replacement. The difference is that each element cannot be repeated more times than it is in perm_unique_helper.



Related Topics



Leave a reply



Submit