Flag first by-group in R data frame
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
How can I mark (flag) first unique record as 1 and the rest similar records as 0 in data frame in R
use duplicated
. If duplicate records it returns 1 else 0, therefore !
bang operator. + before logical converts it to numeric.
df %>% mutate(drive = +!duplicated(paste(date, adress)))
date adress drive
1 28.03 bla 1
2 28.03 xyz 1
3 17.03 abc 1
4 30.03 yxz 1
5 24.03 bla 1
6 17.03 abc 0
7 23.03 abc 1
8 28.03 bla 0
9 24.03 bla 0
10 24.03 bla 0
R - Identify -n rows before and after a flag *by group*
a[zoo::rollapply(a$x, 5, function(z) "1" %in% z, partial = TRUE),]
# x y z
# 2 6 a
# 3 4 a
# 4 1 4 a
# 5 7 a
# 6 9 a
# 10 4 b
# 11 5 b
# 12 1 8 b
# 13 6 b
# 14 1 b
# 15 2 c
# 16 4 c
# 17 1 6 c
# 18 16 c
zoo::rollapply
operates on "windows" of numbers at a time. Here, it's five, which means it looks at five values and returns a single value; then shifts one (four of the same, plus one more), and returns a single value; etc.
Because I specified partial=TRUE
(necessary when you need the output length to be the same as the input length), the length of values looked at might not be the same as the k
ernel width (5).
The point is that if I'm looking at five at a time, if one of them is a "1"
, then we're within 2 rows of a "1
", and should be retained.
An important property of the window is align
ment, where the default is center. It defines where in the window the results go.
In this case, the windows look like:
# [1] "" "" "" "1" "" "" "" "" "" "" "" "1" "" "" "" "" "1" ""
1: nn-------' (partial match)
2: ----yy--------' (partial)
3: `-------yy-------' there is a window in this set of five, so a true ("yy")
4: `-------yy-------'
5: `-------yy-------'
6: `-------yy-------'
7: `-------nn-------' no "1", so a false
... etc
# [1] "" "" "" "1" "" "" "" "" "" "" "" "1" "" "" "" "" "1" ""
You can see in the first seven windows that the first is discarded (there is not a "1"
close enough), we have five true ("yy"
in my nomenclature), and then we get a false ("nn")
since it does not see a "1"
.
Identifying the first rows in a data frame grouped by an ID and date
For each ID
assign 1 to first
where the date is same as first date can be written as :
library(dplyr)
dt %>%
group_by(ID) %>%
mutate(first = as.integer(as.Date(date) == first(as.Date(date)))) %>%
ungroup
and in data.table
:
library(data.table)
setDT(dt)[, first := as.integer(as.Date(date) == first(as.Date(date))), ID]
dt
# ID date type first
# 1: 1 2012-03-21 A 1
# 2: 1 2012-03-21 C 1
# 3: 1 2013-02-11 B 0
# 4: 1 2013-11-19 A 0
# 5: 1 2013-11-19 B 0
# 6: 2 2012-01-03 C 1
# 7: 2 2012-01-14 C 0
# 8: 2 2013-04-07 A 0
# 9: 2 2013-04-07 B 0
#10: 2 2013-04-07 C 0
#11: 2 2014-05-17 C 0
#12: 3 2015-09-23 A 1
#13: 3 2015-12-14 A 0
#14: 4 2014-07-09 C 1
#15: 5 2012-02-28 C 1
#16: 5 2012-02-28 C 1
#17: 6 2013-04-19 C 1
#18: 6 2013-04-19 B 1
#19: 6 2014-10-05 B 0
#20: 6 2014-10-05 A 0
How to flag the first occurrence of a character string in a data frame and all rows after?
Yes, this can be done with dplyr
and character strings. Here's one way:
df %>%
mutate(across(.cols = c("STATE_1", "STATE_2"), ~ na_if(., "NULL"))) %>%
group_by(ID) %>%
fill(STATE_1, STATE_2, .direction = "down") %>%
mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
coalesce(STATE_1, STATE_2),
NA_character_)) %>%
fill(flag, .direction = "down")
ID STATE_1 STATE_2 flag
1 1 <NA> <NA> <NA>
2 1 FRY <NA> FRY
3 1 FRY CRY FRY
4 1 FRY CRY FRY
5 1 FRY CRY FRY
6 1 FRY CRY FRY
7 1 FRY CRY FRY
8 1 FRY CRY FRY
9 5 <NA> <NA> <NA>
10 5 <NA> CRY CRY
11 5 FRY CRY CRY
12 5 FRY CRY CRY
Step by step, here's what this solution does:
- (Convert the character "NULL" values to
NA
.) - Group by
ID
so that we operate within each ID separately. - Use
fill()
to use each non-null value inSTATE_1
andSTATE_2
to populate all the non-null values below it, until we reach the next non-null value. (You can also fill "up" instead of down; down is the default direction, but I've included it explicitly for the sake of clarity.) - Create the
flag
field. If both states in the immediately preceding row have null values (as identified bylag()
), then we want to trigger a new flag; usecoalesce()
to prioritizeSTATE_1
overSTATE_2
. Otherwise, we don't want a new flag; populate withNA
for now. - Use
fill()
again to populate flags downward: each new flag populates the rows below it until we get to a new flag.
This procedure also works for the scenario you described where row 10 has "FRY" for STATE_1
:
df2 = df
df2$STATE_1[10] = "FRY"
df2 %>%
mutate(across(.cols = -c("ID"), ~ na_if(., "NULL"))) %>%
group_by(ID) %>%
fill(STATE_1, STATE_2, .direction = "down") %>%
mutate(flag = if_else(is.na(lag(coalesce(STATE_1, STATE_2))),
coalesce(STATE_1, STATE_2),
NA_character_)) %>%
fill(flag, .direction = "down")
ID STATE_1 STATE_2 flag
1 1 <NA> <NA> <NA>
2 1 FRY <NA> FRY
3 1 FRY CRY FRY
4 1 FRY CRY FRY
5 1 FRY CRY FRY
6 1 FRY CRY FRY
7 1 FRY CRY FRY
8 1 FRY CRY FRY
9 5 <NA> <NA> <NA>
10 5 FRY CRY FRY
11 5 FRY CRY FRY
12 5 FRY CRY FRY
using data.table to flag the first (or last) record in a group
Here are couple of solutions using data.table
:
## Option 1 (cleaner solution, added 2016-11-29)
uDT <- unique(DT)
DT[, c("first","last"):=0L]
DT[uDT, first:=1L, mult="first"]
DT[uDT, last:=1L, mult="last"]
## Option 2 (original answer, retained for posterity)
DT <- cbind(DT, first=0L, last=0L)
DT[DT[unique(DT),,mult="first", which=TRUE], first:=1L]
DT[DT[unique(DT),,mult="last", which=TRUE], last:=1L]
head(DT)
# x y first last
# [1,] a A 1 1
# [2,] a B 1 1
# [3,] a C 1 0
# [4,] a C 0 1
# [5,] b A 1 1
# [6,] b B 1 1
There's obviously a lot packed into each of those lines. The key construct, though, is the following, which returns the row index of the first record in each group:
DT[unique(DT),,mult="first", which=TRUE]
# [1] 1 2 3 5 6 7 11 13 15
How to flag first change in a variable value between years, per group?
possible solution using the dplyr
. not sure its the cleanest way though
sample %>%
group_by(id) %>%
#find first year per group where code exists
mutate(first_year = min(year[code != ""])) %>%
#gather all codes from first year (does not assume code is constant within year)
mutate(first_codes = list(code[year==first_year])) %>%
#if year is not first year & code not in first year codes & code not blank
mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>%
#drop created columns
select(-first_year, -first_codes) %>%
ungroup()
output
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 klm 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
How to flag the last row of a data frame group?
You can group_by
ID
and replace the last row for each ID
with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
flagging the first time a record based on parameter appears in r dataframe
We can group by 'C_ID', 'Name', and create the 'MainRecord' with case_when
library(dplyr)
df1 %>%
group_by(C_ID, Name) %>%
mutate(MainRecord = case_when(row_number()==1 ~ "X", TRUE ~ ""))
# A tibble: 6 x 3
# Groups: C_ID, Name [3]
# C_ID Name MainRecord
# <int> <chr> <chr>
#1 1 JM X
#2 1 JM ""
#3 1 JM ""
#4 2 DM X
#5 3 TY X
#6 3 TY ""
Or another option is ifelse
df1 %>%
group_by(C_ID, Name) %>%
mutate(MainRecord = ifelse(row_number()==1, "X", ""))
Or use indexing
df1 %>%
group_by(C_ID, Name) %>%
mutate(MainRecord = c("", "X")[(row_number()==1) + 1])
Or with data.table
, get the row index with .I
and assign (:=
) the 'X' value that corresponds to the rows
library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) == 1], .(C_ID, Name)]$V1
df1[i1, MainRecord := "X"]
Or with base R
i1 <- with(df1, ave(seq_along(C_ID), C_ID, Name, FUN = seq_along)==1)
df1$MainRecord[i1] <- "X"
Related Topics
R Bookdown - Custom Title Page
Could Not Find Function Tagpos
Shiny Sliderinput from Max to Min
Identify a Value Changes' Date and Summarize The Data with Sum() and Diff() in R
Ggplot: Recommended Colour Palettes Also Distinguishable for B&W Printing
Single Legend When Using Group, Linetype and Colour in Ggplot2
Find Max Per Group and Return Another Column
How to Wrap a Function That Only Takes Individual Elements to Make It Take a List
Why Can't One Have Several 'Value.Var' in 'Dcast'
How to Use Different Font Sizes in Ggplot Facet Wrap Labels
Extract Coefficients from Ggplot2-Created Nls Fit
Importing Many Files at The Same Time and Adding Id Indicator
Encrypt Password in R - to Connect to an Oracle Db Using Rodbc