Assign Unique Id Based on Two Columns

Assign unique ID based on two columns

We can do this in base R without doing any group by operation

df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3

NOTE: Assuming that 'School' and 'Student' are ordered


Or using tidyverse

library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3

As @radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices

df %>% 
mutate(ID = group_indices(., School, Student))

How to create a unique identifier based on multiple columns?

You could try with pd.Series.factorize:

df.set_index(['brand','description']).index.factorize()[0]+1

Output:

0    1
1 2
2 3
3 4
4 5
5 1

So you could try this, to assign it to be the first column:

df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)

Output:

df
product_key brand description former_price discounted_price
0 1 A icecream 1099.0 855.0
1 2 A cheese 469.0 375.0
2 3 B catfood 179.0 119.0
3 4 C NaN 699.0 399.0
4 5 NaN icecream 769.0 549.0
5 1 A icecream 769.0 669.0

Assign unique ID to combination of two columns in pandas dataframe independently on their order

Try np.sort:

a = np.sort(df, axis=1)
df['id'] = df.groupby([a[:,0],a[:,1]]).ngroup() + 1

Output:

   col1  col2  id
0 1 2 1
1 2 1 1
2 2 3 2
3 3 2 2
4 3 4 3
5 4 3 3

Assign an ID based on two columns R

You could try:

library(dplyr)
df %>%
group_by(email) %>%
mutate(ID = row_number())

Which gives:

#Source: local data frame [13 x 4]
#Groups: email
#
# row_num email wk_id ID
#1 1 aaaa 1/4/15 1
#2 2 aaaa 1/11/15 2
#3 3 aaaa 1/25/15 3
#4 4 bbbb 6/29/14 1
#5 5 bbbb 9/7/14 2
#6 6 cccc 11/16/14 1
#7 7 cccc 11/30/14 2
#8 8 cccc 12/7/14 3
#9 9 cccc 12/14/14 4
#10 10 cccc 12/21/14 5
#11 11 cccc 12/28/14 6
#12 12 cccc 1/4/15 7
#13 13 cccc 1/25/15 8

Or using data.table

library(data.table)
setDT(df)[, ID:= 1:.N, email]

Or ave from base R

df$ID <- with(df, ave(row_num, email, FUN=seq_along))

Assign unique id each time two fields/columns are an exact match

Using DENSE_RANK() should do the trick:

SELECT *, 
DENSE_RANK() OVER(ORDER by person_id, datetime) as unique_combination_id
FROM tbl;

DEMO

Assign unique ID based on match between two columns in PySpark Dataframe

I put my explanation in the code. Just be careful about using Window without partition your data, this operation will move all of your data to a single node.

from pyspark.sql.window import Window
import pyspark.sql.functions as f

# [...] Your dataframe initialization

# Creating an index to retrieve original dataframe at the end
df = df.withColumn('index', f.monotonically_increasing_id())

w = Window.orderBy('least')

# Creating a column with least value from `Column1` and `Column2`. This will be used to "group" the values that must have the same ID
df = df.withColumn('least', f.least(f.col('Column1'), f.col('Column2')))

# Check if the current or previous `flag` is false to increase the id
df = df.withColumn('increase', ((~f.col('flag')) | (~f.lag('flag', default=True).over(w))).cast('int'))

# Generating incremental id
df = df.withColumn('ID', f.lit(101) + f.sum('increase').over(w))

(df
.select('ID', 'Column1')
.drop_duplicates()
.sort('index')
.show(truncate=False))

Output

+---+-------+
|ID |Column1|
+---+-------+
|101|1 |
|101|2 |
|101|3 |
|102|4 |
|103|5 |
|104|6 |
|104|7 |
|101|9 |
|105|8 |
+---+-------+

Assign unique ID based on values in EITHER of two columns

We can use duplicated in base R

df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))

-output

df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3

Or using tidyverse

library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)

data

df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

How to assign unique value based on two columns combinations in Python?

If you simply want a unique identifier for each combination of GENUS and SPECIES you can do the following:

Note: In have assumed that either GENUS or SPECIES can contain a None value, which complicates the process slightly.

So Given a DF of the form:

    GENUS   SPECIES
0 Murina Coelodonta
1 Murina Microtherium
2 Microtherium Murina
3 Bachitherium Microtherium
4 Coelodonta None
5 Coelodonta Coelodonta
6 Microtherium Coelodonta
7 Microtherium Murina
8 Microtherium Bachitherium
9 Murina Microtherium

Add a column which uniquely identifies each combination of GENUS and SPECIES. We call this Column 'ID'.

Define a function to create a hash of entries, taking into account the possibility of a None entry.

def hashValues(g, s):
if g == None:
g = "None"
if s == None:
s = 'None'
return hash(g + s)

To add the column use the following:

df['ID'] = [hashValues(df['GENUS'].to_list()[i], df['SPECIES'].to_list()[i]) for i in range(df.shape[0])]  

which yields:

    GENUS           SPECIES         ID
0 Murina Coelodonta -6583287505830614713
1 Murina Microtherium 6019734726691011903
2 Microtherium Murina -2318069015748475190
3 Bachitherium Microtherium 5795352218934423262
4 Coelodonta None 4851538573581845777
5 Coelodonta Coelodonta -5115794138222494493
6 Microtherium Coelodonta 2603682196287415014
7 Microtherium Murina -2318069015748475190
8 Microtherium Bachitherium -2746445536675711990
9 Murina Microtherium 6019734726691011903

Create unique id based on the relation between two columns

For each row create a new variable (maybe a tuple) that have the members of that team.

Id  TeamId  UserId  NewVar
43 504 722 (722, 727)
44 504 727 (722, 727)
45 601 300 (300)
46 602 722 (722, 727)
47 602 727 (722, 727)
48 605 300 (300)
49 777 300 (300, 301)
50 777 301 (300, 301)
51 788 400 (400)
52 789 400 (400)
53 100 727 (727)

after this step compare the NewVar and assign the id
Ps: don't forget to order the NewVar



Related Topics



Leave a reply



Submit