Flag First By-Group in R Data Frame

Flag first by-group in R data frame

> df$first_ind <- as.numeric(!duplicated(df$id))
> df
  id score first_ind
1  1    15         1
2  1    18         0
3  1    16         0
4  2    10         1
5  2     9         0
6  3     8         1
7  3    47         0
8  3    21         0

How can I mark (flag) first unique record as 1 and the rest similar records as 0 in data frame in R

use duplicated. If duplicate records it returns 1 else 0, therefore ! bang operator. + before logical converts it to numeric.

df %>% mutate(drive = +!duplicated(paste(date, adress)))

    date adress drive
1  28.03    bla     1
2  28.03    xyz     1
3  17.03    abc     1
4  30.03    yxz     1
5  24.03    bla     1
6  17.03    abc     0
7  23.03    abc     1
8  28.03    bla     0
9  24.03    bla     0
10 24.03    bla     0

R - Identify -n rows before and after a flag by group

a[zoo::rollapply(a$x, 5, function(z) "1" %in% z, partial = TRUE),]
#    x  y z
# 2     6 a
# 3     4 a
# 4  1  4 a
# 5     7 a
# 6     9 a
# 10    4 b
# 11    5 b
# 12 1  8 b
# 13    6 b
# 14    1 b
# 15    2 c
# 16    4 c
# 17 1  6 c
# 18   16 c

zoo::rollapply operates on "windows" of numbers at a time. Here, it's five, which means it looks at five values and returns a single value; then shifts one (four of the same, plus one more), and returns a single value; etc.

Because I specified partial=TRUE (necessary when you need the output length to be the same as the input length), the length of values looked at might not be the same as the kernel width (5).

The point is that if I'm looking at five at a time, if one of them is a "1", then we're within 2 rows of a "1", and should be retained.

An important property of the window is alignment, where the default is center. It defines where in the window the results go.

In this case, the windows look like:

#  [1] ""  ""  ""  "1" ""  ""  ""  ""  ""  ""  ""  "1" ""  ""  ""  ""  "1" "" 
1:     nn-------' (partial match)
2:     ----yy--------' (partial)
3:     `-------yy-------'  there is a window in this set of five, so a true ("yy")
4:         `-------yy-------'
5:             `-------yy-------'
6:                 `-------yy-------'
7:                     `-------nn-------' no "1", so a false
... etc
#  [1] ""  ""  ""  "1" ""  ""  ""  ""  ""  ""  ""  "1" ""  ""  ""  ""  "1" ""

You can see in the first seven windows that the first is discarded (there is not a "1" close enough), we have five true ("yy" in my nomenclature), and then we get a false ("nn") since it does not see a "1".

Identifying the first rows in a data frame grouped by an ID and date

For each ID assign 1 to first where the date is same as first date can be written as :

library(dplyr)

dt %>%
  group_by(ID) %>%
  mutate(first = as.integer(as.Date(date) == first(as.Date(date)))) %>%
  ungroup

and in data.table :

library(data.table)
setDT(dt)[, first := as.integer(as.Date(date) == first(as.Date(date))), ID]
dt

#    ID       date type first
# 1:  1 2012-03-21    A     1
# 2:  1 2012-03-21    C     1
# 3:  1 2013-02-11    B     0
# 4:  1 2013-11-19    A     0
# 5:  1 2013-11-19    B     0
# 6:  2 2012-01-03    C     1
# 7:  2 2012-01-14    C     0
# 8:  2 2013-04-07    A     0
# 9:  2 2013-04-07    B     0
#10:  2 2013-04-07    C     0
#11:  2 2014-05-17    C     0
#12:  3 2015-09-23    A     1
#13:  3 2015-12-14    A     0
#14:  4 2014-07-09    C     1
#15:  5 2012-02-28    C     1
#16:  5 2012-02-28    C     1
#17:  6 2013-04-19    C     1
#18:  6 2013-04-19    B     1
#19:  6 2014-10-05    B     0
#20:  6 2014-10-05    A     0

How to flag the first occurrence of a character string in a data frame and all rows after?

Yes, this can be done with dplyr and character strings. Here's one way:

df %>%
  mutate(across(.cols = c("STATE_1", "STATE_2"), ~ na_if(., "NULL"))) %>%
  group_by(ID) %>%
  fill(STATE_1, STATE_2, .direction = "down") %>%
  mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
                        coalesce(STATE_1, STATE_2),
                        NA_character_)) %>%
  fill(flag, .direction = "down")

   ID STATE_1 STATE_2 flag
1   1    <NA>    <NA> <NA>
2   1     FRY    <NA>  FRY
3   1     FRY     CRY  FRY
4   1     FRY     CRY  FRY
5   1     FRY     CRY  FRY
6   1     FRY     CRY  FRY
7   1     FRY     CRY  FRY
8   1     FRY     CRY  FRY
9   5    <NA>    <NA> <NA>
10  5    <NA>     CRY  CRY
11  5     FRY     CRY  CRY
12  5     FRY     CRY  CRY

Step by step, here's what this solution does:

(Convert the character "NULL" values to NA.)
Group by ID so that we operate within each ID separately.
Use fill() to use each non-null value in STATE_1 and STATE_2 to populate all the non-null values below it, until we reach the next non-null value. (You can also fill "up" instead of down; down is the default direction, but I've included it explicitly for the sake of clarity.)
Create the flag field. If both states in the immediately preceding row have null values (as identified by lag()), then we want to trigger a new flag; use coalesce() to prioritize STATE_1 over STATE_2. Otherwise, we don't want a new flag; populate with NA for now.
Use fill() again to populate flags downward: each new flag populates the rows below it until we get to a new flag.

This procedure also works for the scenario you described where row 10 has "FRY" for STATE_1:

df2 = df
df2$STATE_1[10] = "FRY"
df2 %>%
  mutate(across(.cols = -c("ID"), ~ na_if(., "NULL"))) %>%
  group_by(ID) %>%
  fill(STATE_1, STATE_2, .direction = "down") %>%
  mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
                        coalesce(STATE_1, STATE_2),
                        NA_character_)) %>%
  fill(flag, .direction = "down")

   ID STATE_1 STATE_2 flag
1   1    <NA>    <NA> <NA>
2   1     FRY    <NA>  FRY
3   1     FRY     CRY  FRY
4   1     FRY     CRY  FRY
5   1     FRY     CRY  FRY
6   1     FRY     CRY  FRY
7   1     FRY     CRY  FRY
8   1     FRY     CRY  FRY
9   5    <NA>    <NA> <NA>
10  5     FRY     CRY  FRY
11  5     FRY     CRY  FRY
12  5     FRY     CRY  FRY

using data.table to flag the first (or last) record in a group

Here are couple of solutions using data.table:

## Option 1 (cleaner solution, added 2016-11-29)
uDT <- unique(DT)
DT[, c("first","last"):=0L]
DT[uDT, first:=1L, mult="first"]
DT[uDT, last:=1L, mult="last"]

## Option 2 (original answer, retained for posterity)
DT <- cbind(DT, first=0L, last=0L)
DT[DT[unique(DT),,mult="first", which=TRUE], first:=1L]
DT[DT[unique(DT),,mult="last", which=TRUE], last:=1L]

head(DT)
#      x y first last
# [1,] a A     1    1
# [2,] a B     1    1
# [3,] a C     1    0
# [4,] a C     0    1
# [5,] b A     1    1
# [6,] b B     1    1

There's obviously a lot packed into each of those lines. The key construct, though, is the following, which returns the row index of the first record in each group:

DT[unique(DT),,mult="first", which=TRUE]
# [1]  1  2  3  5  6  7 11 13 15

How to flag first change in a variable value between years, per group?

possible solution using the dplyr. not sure its the cleanest way though

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

output

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

How to flag the last row of a data frame group?

You can group_by ID and replace the last row for each ID with 0.

library(dplyr)

df %>% 
  mutate(Calculate = Period * Value) %>%
  group_by(ID) %>%
  mutate(Calculate = replace(Calculate, n(), 0)) %>%
  ungroup

#     ID Period Value Calculate
#  <dbl>  <dbl> <dbl>     <dbl>
#1     1      1    10        10
#2     1      2    12        24
#3     1      3    11         0
#4     5      1     4         4
#5     5      2     6         0

flagging the first time a record based on parameter appears in r dataframe

We can group by 'C_ID', 'Name', and create the 'MainRecord' with case_when

library(dplyr)
df1 %>%
  group_by(C_ID, Name) %>%
  mutate(MainRecord = case_when(row_number()==1 ~ "X", TRUE ~ ""))
# A tibble: 6 x 3
# Groups:   C_ID, Name [3]
#   C_ID Name  MainRecord
#  <int> <chr> <chr>     
#1     1 JM    X         
#2     1 JM    ""        
#3     1 JM    ""        
#4     2 DM    X         
#5     3 TY    X         
#6     3 TY    ""

Or another option is ifelse

df1 %>%
   group_by(C_ID, Name) %>% 
   mutate(MainRecord = ifelse(row_number()==1, "X", ""))

Or use indexing

df1 %>% 
   group_by(C_ID, Name) %>% 
   mutate(MainRecord = c("", "X")[(row_number()==1) + 1])

Or with data.table, get the row index with .I and assign (:=) the 'X' value that corresponds to the rows

library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) == 1], .(C_ID, Name)]$V1
df1[i1, MainRecord := "X"]

Or with base R

i1 <- with(df1, ave(seq_along(C_ID), C_ID, Name, FUN = seq_along)==1)
df1$MainRecord[i1] <- "X"

Flag First By-Group in R Data Frame