How to Create a Rank Variable Under Certain Conditions

How to create a rank variable under certain conditions?

This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast() to reshape from long to wide format, and a second join with the original dat. There can be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
        dat, on = "shoptime"]

               shoptime brand1 brand2 brand3 brand5 brand6  brand
 1: 2013-09-01 08:35:00     NA     NA     NA     NA     NA brand1
 2: 2013-09-01 08:54:00      1     NA     NA     NA     NA brand1
 3: 2013-09-01 09:07:00      1     NA     NA     NA     NA brand2
 4: 2013-09-01 09:08:00      1      2     NA     NA     NA brand3
 5: 2013-09-01 09:11:00      1      2      2     NA     NA brand5
 6: 2013-09-01 09:14:00      1      2      2      2     NA brand2
 7: 2013-09-01 09:26:00      1      1      2      2     NA brand6
 8: 2013-09-01 09:26:00      1      1      2      2     NA brand2
 9: 2013-09-01 09:29:00      2      1      3      3      3 brand2
10: 2013-09-01 09:32:00      2      1      3      3      3 brand4

Explanation

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

returns the aggregated results per 36 hours periods:

               shoptime            shoptime  brand N rank
 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1    1
 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2    1
 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2    1
 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1    2
 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2    1
 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1    2
 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1    2
 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2    1
 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1    2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1    2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1    2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2    2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3    1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1    3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1    3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1    3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2    2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4    1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1    3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1    3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1    3
               shoptime            shoptime  brand N rank

Then, this intermediate result is reshaped from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]

              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00      1     NA     NA     NA     NA
2: 2013-09-01 09:07:00      1     NA     NA     NA     NA
3: 2013-09-01 09:08:00      1      2     NA     NA     NA
4: 2013-09-01 09:11:00      1      2      2     NA     NA
5: 2013-09-01 09:14:00      1      2      2      2     NA
6: 2013-09-01 09:26:00      1      1      2      2     NA
7: 2013-09-01 09:29:00      2      1      3      3      3
8: 2013-09-01 09:32:00      2      1      3      3      3

The final right join with the original dat data frame completes the missing rows and columns (see code and result above).

Data

dat <- data.frame(
  shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
  brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
  stringsAsFactors = FALSE)

How to create a rank-based column based on multiple column conditions in r dataframe

After grouping by 'user_id', create the 'new_rank1' by taking the lag of the cumsum of the logical vector

library(dplyr)
df %>% 
   group_by(user_id) %>% 
   mutate(new_rank1 = lag(cumsum(matric_1 > 15 & matric_2 > 20) + 1, default = 1))
# A tibble: 16 x 5
# Groups: user_id [4]
#   user_id matric_1 matric_2 new_rank new_rank1
#   <fctr>     <dbl>    <dbl>    <dbl>     <dbl>
# 1 a          10.0     10.0      1.00      1.00
# 2 a          23.0     25.0      1.00      1.00
# 3 a           4.00    10.0      2.00      2.00
# 4 a           5.00    13.0      2.00      2.00
# 5 b          17.0     21.0      1.00      1.00
# 6 b           5.00    10.0      2.00      2.00
# 7 b          40.0      7.00     2.00      2.00
# 8 c           1.00     3.00     1.00      1.00
# 9 c           2.00     4.00     1.00      1.00
#10 c          18.0     22.0      1.00      1.00
#11 c          19.0     21.0      2.00      2.00
#12 c           5.00     4.00     3.00      3.00
#13 d          18.0     23.0      1.00      1.00
#14 d           2.00     4.00     2.00      2.00
#15 d          19.0     21.0      2.00      2.00
#16 d           2.00     4.00     3.00      3.00

Ranking variables with conditions

Super easy to do with data.table:

library(data.table)
dt = data.table(df)

# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]

dt[sales >= 50, rank := .I]
dt
#   store sales successRate rank
#1:     D    66        0.92    1
#2:     E    23        0.85   NA
#3:     A     9        0.80   NA
#4:     G    89        0.54    2
#5:     C    54        0.54    3
#6:     H    70        0.46    4
#7:     F   132        0.35    5
#8:     B   128        0.25    6

If you must do it in data.frame, then after your preferred order, run:

df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))

How can I rank based on condition in Pandas

Use:

#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

Remove row with CPI by condition - test rows if no only CPI per Cluster:

m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)

print (df)
   Cluster  Variable Group  Ratio  Value  RankRatio  RankValue
0        1    GDP_M3   GDP   20.0   70.0        1.0        2.0
1        1    HPI_M6   HPI   40.0   80.0        3.0        1.0
2        1   GDP_lg2   GDP   35.0   50.0        2.0        3.0
3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN
4        2   HPI_lg6   HPI   15.0   65.0        1.0        1.0
5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0
6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0

How it working:

For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:

print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
   Cluster  Variable Group  Ratio  Value  mask1  mask2   both
0        1    GDP_M3   GDP   20.0   70.0  False   True   True
1        1    HPI_M6   HPI   40.0   80.0  False   True   True
2        1   GDP_lg2   GDP   35.0   50.0  False   True   True
3        2    CPI_M9   CPI   10.0   50.0   True  False   True
4        2   HPI_lg6   HPI   15.0   65.0  False  False  False
5        3  CPI_lg12   CPI   15.0   90.0   True  False   True
6        3   CPI_lg1   CPI   20.0   95.0   True  False   True

EDIT:

df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
   Cluster  Variable Group  Ratio  Value  RankRatio  RankValue
0        1    GDP_M3   GDP   20.0   70.0        1.0        1.0
1        1    HPI_M6   HPI   40.0   80.0        NaN        NaN
2        1   GDP_lg2   GDP   35.0   50.0        2.0        2.0
3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN
4        2   HPI_lg6   HPI   15.0   65.0        NaN        NaN
5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0
6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0

print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
   Cluster  Variable Group  Ratio  Value  RankRatio  RankValue  mask1  mask2  \
0        1    GDP_M3   GDP   20.0   70.0        1.0        1.0  False  False   
1        1    HPI_M6   HPI   40.0   80.0        NaN        NaN   True  False   
2        1   GDP_lg2   GDP   35.0   50.0        2.0        2.0  False  False   
3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN   True   True   
4        2   HPI_lg6   HPI   15.0   65.0        NaN        NaN   True   True   
5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0   True   True   
6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0   True   True   

      m2    all  
0   True   True  
1   True  False  
2   True   True  
3   True  False  
4   True  False  
5  False   True  
6  False   True

Creating a new ranking column based on conditions in R

first group_by the "per company, profile_id" variables and then apply rank() :

library(dplyr)
df %>% group_by(company, profile_id) %>% mutate(rank = rank(price))

library(data.table)
df[,rank:=rank(price),by = .(company, profile_id)]

#     id profile_id company product price  rank
#1     1          1       A    book 10.42     2
#2     2          1       A   shirt 23.91     3
#3     3          1       A     cup  5.95     1
#4     4          2       B    book  7.99     2
#5     5          2       B   shirt  5.95     1
#6     6          2       B     cup 11.76     3

Create a ranking variable with dplyr?

It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.

Try this:

df %>% mutate(rank = dense_rank(desc(score)))
#   name score rank
# 1    A    10    1
# 2    B    10    1
# 3    C     9    2
# 4    D     8    3

R-ranking values of a column by grouping, conditional to another variable

You could try

library(dplyr)
 df %>%
    group_by(group, col1) %>% 
    mutate(rank=replace(min_rank(col2), col1=='b',0) )
#    group col1 col2 rank
#1     x    a   22    1
#2     x    a   23    2
#3     x    b   16    0
#4     x    b   18    0
#5     y    a   11    1
#6     y    a   12    2
#7     y    a   16    3
#8     y    a   45    4
#9     y    b   24    0

If you don't want gaps between ranks when there are ties, replace min_rank with dense_rank

Or, instead of replace

 res <- df %>% 
          group_by(group, col1) %>% 
          mutate(rank=(col1!='b')*min_rank(col2))

 as.data.frame(res) #would be `data.frame`
 #    group col1 col2 rank
 #1     x    a   22    1
 #2     x    a   23    2
 #3     x    b   16    0
 #4     x    b   18    0
 #5     y    a   11    1
 #6     y    a   12    2
 #7     y    a   16    3
 #8     y    a   45    4
 #9     y    b   24    0

How to Create a Rank Variable Under Certain Conditions