Assign Unique Id Based on Two Columns

Assign unique ID based on two columns

We can do this in base R without doing any group by operation

df$ID <- cumsum(!duplicated(df[1:2]))
df
#   School Student Year ID
#1      A      10 1999  1
#2      A      10 2000  1
#3      A      20 1999  2
#4      A      20 2000  2
#5      A      20 2001  2
#6      B      10 1999  3
#7      B      10 2000  3

NOTE: Assuming that 'School' and 'Student' are ordered

Or using tidyverse

library(dplyr)
df %>% 
    mutate(ID = group_indices_(df, .dots=c("School", "Student"))) 
#  School Student Year ID
#1      A      10 1999  1
#2      A      10 2000  1
#3      A      20 1999  2
#4      A      20 2000  2
#5      A      20 2001  2
#6      B      10 1999  3
#7      B      10 2000  3

As @radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices

df %>% 
   mutate(ID = group_indices(., School, Student))

How to create a unique identifier based on multiple columns?

You could try with pd.Series.factorize:

df.set_index(['brand','description']).index.factorize()[0]+1

Output:

So you could try this, to assign it to be the first column:

df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)

Output:

df
   product_key brand description  former_price  discounted_price
0            1     A    icecream        1099.0             855.0
1            2     A      cheese         469.0             375.0
2            3     B     catfood         179.0             119.0
3            4     C         NaN         699.0             399.0
4            5   NaN    icecream         769.0             549.0
5            1     A    icecream         769.0             669.0

Assign unique ID to combination of two columns in pandas dataframe independently on their order

Try np.sort:

a = np.sort(df, axis=1)
df['id'] = df.groupby([a[:,0],a[:,1]]).ngroup() + 1

Output:

   col1  col2  id
0     1     2   1
1     2     1   1
2     2     3   2
3     3     2   2
4     3     4   3
5     4     3   3

Assign an ID based on two columns R

You could try:

library(dplyr)
df %>%
   group_by(email) %>% 
   mutate(ID = row_number())

Which gives:

#Source: local data frame [13 x 4]
#Groups: email
#
#   row_num email    wk_id ID
#1        1  aaaa   1/4/15  1
#2        2  aaaa  1/11/15  2
#3        3  aaaa  1/25/15  3
#4        4  bbbb  6/29/14  1
#5        5  bbbb   9/7/14  2
#6        6  cccc 11/16/14  1
#7        7  cccc 11/30/14  2
#8        8  cccc  12/7/14  3
#9        9  cccc 12/14/14  4
#10      10  cccc 12/21/14  5
#11      11  cccc 12/28/14  6
#12      12  cccc   1/4/15  7
#13      13  cccc  1/25/15  8

Or using data.table

library(data.table)
setDT(df)[, ID:= 1:.N, email]

Or ave from base R

df$ID <- with(df, ave(row_num, email, FUN=seq_along))

Assign unique id each time two fields/columns are an exact match

Using DENSE_RANK() should do the trick:

SELECT *, 
   DENSE_RANK() OVER(ORDER by person_id, datetime) as unique_combination_id
FROM tbl;

DEMO

Assign unique ID based on match between two columns in PySpark Dataframe

I put my explanation in the code. Just be careful about using Window without partition your data, this operation will move all of your data to a single node.

from pyspark.sql.window import Window
import pyspark.sql.functions as f

# [...] Your dataframe initialization

# Creating an index to retrieve original dataframe at the end
df = df.withColumn('index', f.monotonically_increasing_id())

w = Window.orderBy('least')

# Creating a column with least value from `Column1` and `Column2`. This will be used to "group" the values that must have the same ID
df = df.withColumn('least', f.least(f.col('Column1'), f.col('Column2')))

# Check if the current or previous `flag` is false to increase the id
df = df.withColumn('increase', ((~f.col('flag')) | (~f.lag('flag', default=True).over(w))).cast('int'))

# Generating incremental id
df = df.withColumn('ID', f.lit(101) + f.sum('increase').over(w))

(df
 .select('ID', 'Column1')
 .drop_duplicates()
 .sort('index')
 .show(truncate=False))

Output

+---+-------+
|ID |Column1|
+---+-------+
|101|1      |
|101|2      |
|101|3      |
|102|4      |
|103|5      |
|104|6      |
|104|7      |
|101|9      |
|105|8      |
+---+-------+

Assign unique ID based on values in EITHER of two columns

We can use duplicated in base R

df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))

-output

df1
# A tibble: 5 x 3
#  color  shape    group_id
#  <chr>  <chr>       <int>
#1 blue   triangle        1
#2 blue   square          1
#3 red    circle          2
#4 green  hexagon         3
#5 purple hexagon         3

Or using tidyverse

library(dplyr)
library(purrr)
df1 %>%
    mutate(group_id = map(.,  duplicated) %>%
                         reduce(`|`) %>%
                         `!` %>% 
                       cumsum)

data

df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

How to assign unique value based on two columns combinations in Python?

If you simply want a unique identifier for each combination of GENUS and SPECIES you can do the following:

Note: In have assumed that either GENUS or SPECIES can contain a None value, which complicates the process slightly.

So Given a DF of the form:

    GENUS   SPECIES
0   Murina  Coelodonta
1   Murina  Microtherium
2   Microtherium    Murina
3   Bachitherium    Microtherium
4   Coelodonta  None
5   Coelodonta  Coelodonta
6   Microtherium    Coelodonta
7   Microtherium    Murina
8   Microtherium    Bachitherium
9   Murina  Microtherium

Add a column which uniquely identifies each combination of GENUS and SPECIES. We call this Column 'ID'.

Define a function to create a hash of entries, taking into account the possibility of a None entry.

def hashValues(g, s):
    if g == None:
        g = "None"
    if s == None:
        s = 'None'
    return hash(g + s)

To add the column use the following:

df['ID'] = [hashValues(df['GENUS'].to_list()[i], df['SPECIES'].to_list()[i]) for i in range(df.shape[0])]

which yields:

    GENUS           SPECIES         ID
0   Murina          Coelodonta      -6583287505830614713
1   Murina          Microtherium    6019734726691011903
2   Microtherium    Murina          -2318069015748475190
3   Bachitherium    Microtherium    5795352218934423262
4   Coelodonta      None            4851538573581845777
5   Coelodonta      Coelodonta      -5115794138222494493
6   Microtherium    Coelodonta      2603682196287415014
7   Microtherium    Murina          -2318069015748475190
8   Microtherium    Bachitherium    -2746445536675711990
9   Murina          Microtherium    6019734726691011903

Create unique id based on the relation between two columns

For each row create a new variable (maybe a tuple) that have the members of that team.

Id  TeamId  UserId  NewVar
43  504     722     (722, 727)
44  504     727     (722, 727)
45  601     300     (300)
46  602     722     (722, 727)
47  602     727     (722, 727)
48  605     300     (300)
49  777     300     (300, 301)
50  777     301     (300, 301)
51  788     400     (400)
52  789     400     (400)
53  100     727     (727)

after this step compare the NewVar and assign the id
Ps: don't forget to order the NewVar

Assign Unique Id Based on Two Columns