Assign unique ID based on two columns
We can do this in base R
without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As @radek mentioned, in the recent version (dplyr_0.8.0
), we get the notification that group_indices_
is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
How to create a unique identifier based on multiple columns?
You could try with pd.Series.factorize
:
df.set_index(['brand','description']).index.factorize()[0]+1
Output:
0 1
1 2
2 3
3 4
4 5
5 1
So you could try this, to assign it to be the first column:
df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)
Output:
df
product_key brand description former_price discounted_price
0 1 A icecream 1099.0 855.0
1 2 A cheese 469.0 375.0
2 3 B catfood 179.0 119.0
3 4 C NaN 699.0 399.0
4 5 NaN icecream 769.0 549.0
5 1 A icecream 769.0 669.0
Assign unique ID to combination of two columns in pandas dataframe independently on their order
Try np.sort
:
a = np.sort(df, axis=1)
df['id'] = df.groupby([a[:,0],a[:,1]]).ngroup() + 1
Output:
col1 col2 id
0 1 2 1
1 2 1 1
2 2 3 2
3 3 2 2
4 3 4 3
5 4 3 3
Assign an ID based on two columns R
You could try:
library(dplyr)
df %>%
group_by(email) %>%
mutate(ID = row_number())
Which gives:
#Source: local data frame [13 x 4]
#Groups: email
#
# row_num email wk_id ID
#1 1 aaaa 1/4/15 1
#2 2 aaaa 1/11/15 2
#3 3 aaaa 1/25/15 3
#4 4 bbbb 6/29/14 1
#5 5 bbbb 9/7/14 2
#6 6 cccc 11/16/14 1
#7 7 cccc 11/30/14 2
#8 8 cccc 12/7/14 3
#9 9 cccc 12/14/14 4
#10 10 cccc 12/21/14 5
#11 11 cccc 12/28/14 6
#12 12 cccc 1/4/15 7
#13 13 cccc 1/25/15 8
Or using data.table
library(data.table)
setDT(df)[, ID:= 1:.N, email]
Or ave
from base R
df$ID <- with(df, ave(row_num, email, FUN=seq_along))
Assign unique id each time two fields/columns are an exact match
Using DENSE_RANK()
should do the trick:
SELECT *,
DENSE_RANK() OVER(ORDER by person_id, datetime) as unique_combination_id
FROM tbl;
DEMO
Assign unique ID based on match between two columns in PySpark Dataframe
I put my explanation in the code. Just be careful about using Window
without partition your data, this operation will move all of your data to a single node.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# [...] Your dataframe initialization
# Creating an index to retrieve original dataframe at the end
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.orderBy('least')
# Creating a column with least value from `Column1` and `Column2`. This will be used to "group" the values that must have the same ID
df = df.withColumn('least', f.least(f.col('Column1'), f.col('Column2')))
# Check if the current or previous `flag` is false to increase the id
df = df.withColumn('increase', ((~f.col('flag')) | (~f.lag('flag', default=True).over(w))).cast('int'))
# Generating incremental id
df = df.withColumn('ID', f.lit(101) + f.sum('increase').over(w))
(df
.select('ID', 'Column1')
.drop_duplicates()
.sort('index')
.show(truncate=False))
Output
+---+-------+
|ID |Column1|
+---+-------+
|101|1 |
|101|2 |
|101|3 |
|102|4 |
|103|5 |
|104|6 |
|104|7 |
|101|9 |
|105|8 |
+---+-------+
Assign unique ID based on values in EITHER of two columns
We can use duplicated
in base R
df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))
-output
df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)
data
df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
How to assign unique value based on two columns combinations in Python?
If you simply want a unique identifier for each combination of GENUS and SPECIES you can do the following:
Note: In have assumed that either GENUS or SPECIES can contain a None
value, which complicates the process slightly.
So Given a DF of the form:
GENUS SPECIES
0 Murina Coelodonta
1 Murina Microtherium
2 Microtherium Murina
3 Bachitherium Microtherium
4 Coelodonta None
5 Coelodonta Coelodonta
6 Microtherium Coelodonta
7 Microtherium Murina
8 Microtherium Bachitherium
9 Murina Microtherium
Add a column which uniquely identifies each combination of GENUS and SPECIES. We call this Column 'ID'.
Define a function to create a hash of entries, taking into account the possibility of a None
entry.
def hashValues(g, s):
if g == None:
g = "None"
if s == None:
s = 'None'
return hash(g + s)
To add the column use the following:
df['ID'] = [hashValues(df['GENUS'].to_list()[i], df['SPECIES'].to_list()[i]) for i in range(df.shape[0])]
which yields:
GENUS SPECIES ID
0 Murina Coelodonta -6583287505830614713
1 Murina Microtherium 6019734726691011903
2 Microtherium Murina -2318069015748475190
3 Bachitherium Microtherium 5795352218934423262
4 Coelodonta None 4851538573581845777
5 Coelodonta Coelodonta -5115794138222494493
6 Microtherium Coelodonta 2603682196287415014
7 Microtherium Murina -2318069015748475190
8 Microtherium Bachitherium -2746445536675711990
9 Murina Microtherium 6019734726691011903
Create unique id based on the relation between two columns
For each row create a new variable (maybe a tuple) that have the members of that team.
Id TeamId UserId NewVar
43 504 722 (722, 727)
44 504 727 (722, 727)
45 601 300 (300)
46 602 722 (722, 727)
47 602 727 (722, 727)
48 605 300 (300)
49 777 300 (300, 301)
50 777 301 (300, 301)
51 788 400 (400)
52 789 400 (400)
53 100 727 (727)
after this step compare the NewVar and assign the id
Ps: don't forget to order the NewVar
Related Topics
Moving Average of Previous Three Values in R
R - Use Rbind on Multiple Variables with Similar Names
Why Is Using Update on a Lm Inside a Grouped Data.Table Losing Its Model Data
How to Use the Strsplit Function with a Period
Select Values from Different Columns Based on a Variable Containing Column Names
How to Draw Stacked Bars in Ggplot2 That Show Percentages Based on Group
Unicode Characters in Ggplot2 PDF Output
Rolling Join on Data.Table with Duplicate Keys
Using Cut and Quartile to Generate Breaks in R Function
Assign Unique Id Based on Two Columns
How to Escape a Backslash in R
How to Generate a Matrix of Combinations
Different Size Facets Proportional of X Axis on Ggplot 2 R
How to Get Name of Variable in R (Substitute)
Efficient Way to Filter One Data Frame by Ranges in Another
How to Add a Cumulative Column to an R Dataframe Using Dplyr