How to Create a Rank Variable Under Certain Conditions

How to create a rank variable under certain conditions?

This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast() to reshape from long to wide format, and a second join with the original dat. There can be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
dat, on = "shoptime"]
               shoptime brand1 brand2 brand3 brand5 brand6  brand
1: 2013-09-01 08:35:00 NA NA NA NA NA brand1
2: 2013-09-01 08:54:00 1 NA NA NA NA brand1
3: 2013-09-01 09:07:00 1 NA NA NA NA brand2
4: 2013-09-01 09:08:00 1 2 NA NA NA brand3
5: 2013-09-01 09:11:00 1 2 2 NA NA brand5
6: 2013-09-01 09:14:00 1 2 2 2 NA brand2
7: 2013-09-01 09:26:00 1 1 2 2 NA brand6
8: 2013-09-01 09:26:00 1 1 2 2 NA brand2
9: 2013-09-01 09:29:00 2 1 3 3 3 brand2
10: 2013-09-01 09:32:00 2 1 3 3 3 brand4

Explanation

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

returns the aggregated results per 36 hours periods:

               shoptime            shoptime  brand N rank
1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1 1
2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2 1
3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2 1
4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1 2
5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2 1
6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1 2
7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1 2
8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2 1
9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1 2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1 2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1 2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2 2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3 1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1 3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1 3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1 3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2 2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4 1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1 3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1 3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1 3
shoptime shoptime brand N rank

Then, this intermediate result is reshaped from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]
              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00 1 NA NA NA NA
2: 2013-09-01 09:07:00 1 NA NA NA NA
3: 2013-09-01 09:08:00 1 2 NA NA NA
4: 2013-09-01 09:11:00 1 2 2 NA NA
5: 2013-09-01 09:14:00 1 2 2 2 NA
6: 2013-09-01 09:26:00 1 1 2 2 NA
7: 2013-09-01 09:29:00 2 1 3 3 3
8: 2013-09-01 09:32:00 2 1 3 3 3

The final right join with the original dat data frame completes the missing rows and columns (see code and result above).

Data

dat <- data.frame(
shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)

How to create a rank-based column based on multiple column conditions in r dataframe

After grouping by 'user_id', create the 'new_rank1' by taking the lag of the cumsum of the logical vector

library(dplyr)
df %>%
group_by(user_id) %>%
mutate(new_rank1 = lag(cumsum(matric_1 > 15 & matric_2 > 20) + 1, default = 1))
# A tibble: 16 x 5
# Groups: user_id [4]
# user_id matric_1 matric_2 new_rank new_rank1
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 a 10.0 10.0 1.00 1.00
# 2 a 23.0 25.0 1.00 1.00
# 3 a 4.00 10.0 2.00 2.00
# 4 a 5.00 13.0 2.00 2.00
# 5 b 17.0 21.0 1.00 1.00
# 6 b 5.00 10.0 2.00 2.00
# 7 b 40.0 7.00 2.00 2.00
# 8 c 1.00 3.00 1.00 1.00
# 9 c 2.00 4.00 1.00 1.00
#10 c 18.0 22.0 1.00 1.00
#11 c 19.0 21.0 2.00 2.00
#12 c 5.00 4.00 3.00 3.00
#13 d 18.0 23.0 1.00 1.00
#14 d 2.00 4.00 2.00 2.00
#15 d 19.0 21.0 2.00 2.00
#16 d 2.00 4.00 3.00 3.00

Ranking variables with conditions

Super easy to do with data.table:

library(data.table)
dt = data.table(df)

# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]

dt[sales >= 50, rank := .I]
dt
# store sales successRate rank
#1: D 66 0.92 1
#2: E 23 0.85 NA
#3: A 9 0.80 NA
#4: G 89 0.54 2
#5: C 54 0.54 3
#6: H 70 0.46 4
#7: F 132 0.35 5
#8: B 128 0.25 6

If you must do it in data.frame, then after your preferred order, run:

df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))

How can I rank based on condition in Pandas

Use:

#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

Remove row with CPI by condition - test rows if no only CPI per Cluster:

m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)

print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 2.0
1 1 HPI_M6 HPI 40.0 80.0 3.0 1.0
2 1 GDP_lg2 GDP 35.0 50.0 2.0 3.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 1.0 1.0
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0

How it working:

For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:

print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
Cluster Variable Group Ratio Value mask1 mask2 both
0 1 GDP_M3 GDP 20.0 70.0 False True True
1 1 HPI_M6 HPI 40.0 80.0 False True True
2 1 GDP_lg2 GDP 35.0 50.0 False True True
3 2 CPI_M9 CPI 10.0 50.0 True False True
4 2 HPI_lg6 HPI 15.0 65.0 False False False
5 3 CPI_lg12 CPI 15.0 90.0 True False True
6 3 CPI_lg1 CPI 20.0 95.0 True False True

EDIT:

df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0


print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
Cluster Variable Group Ratio Value RankRatio RankValue mask1 mask2 \
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0 False False
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN True False
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0 False False
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN True True
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN True True
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0 True True
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0 True True

m2 all
0 True True
1 True False
2 True True
3 True False
4 True False
5 False True
6 False True

Creating a new ranking column based on conditions in R

first group_by the "per company, profile_id" variables and then apply rank() :

library(dplyr)
df %>% group_by(company, profile_id) %>% mutate(rank = rank(price))

library(data.table)
df[,rank:=rank(price),by = .(company, profile_id)]

# id profile_id company product price rank
#1 1 1 A book 10.42 2
#2 2 1 A shirt 23.91 3
#3 3 1 A cup 5.95 1
#4 4 2 B book 7.99 2
#5 5 2 B shirt 5.95 1
#6 6 2 B cup 11.76 3

Create a ranking variable with dplyr?

It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.

Try this:

df %>% mutate(rank = dense_rank(desc(score)))
# name score rank
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3

R-ranking values of a column by grouping, conditional to another variable

You could try

library(dplyr)
df %>%
group_by(group, col1) %>%
mutate(rank=replace(min_rank(col2), col1=='b',0) )
# group col1 col2 rank
#1 x a 22 1
#2 x a 23 2
#3 x b 16 0
#4 x b 18 0
#5 y a 11 1
#6 y a 12 2
#7 y a 16 3
#8 y a 45 4
#9 y b 24 0

If you don't want gaps between ranks when there are ties, replace min_rank with dense_rank

Or, instead of replace

 res <- df %>% 
group_by(group, col1) %>%
mutate(rank=(col1!='b')*min_rank(col2))

as.data.frame(res) #would be `data.frame`
# group col1 col2 rank
#1 x a 22 1
#2 x a 23 2
#3 x b 16 0
#4 x b 18 0
#5 y a 11 1
#6 y a 12 2
#7 y a 16 3
#8 y a 45 4
#9 y b 24 0


Related Topics



Leave a reply



Submit