Replace a Subset of a Data Frame with Dplyr Join Operations

Replace a subset of a data frame with dplyr join operations

What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).

Here's an example for your toy data:

library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy

# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]

The data is now updated:

#   id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3

You can use setDF to switch back to ordinary data.frame.

dplyr mutate/replace several columns on a subset of rows

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}

DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))

4) row_case_when Also check out row_case_when defined in
Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.

library(dplyr)

DF %>%
row_case_when(
measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
TRUE ~ data.frame(qty.exit, cf, delta.watts)
)

Note 1: We used this as DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

Partially replace grouped data in one dataframe with data from other dataframe

What do you think of this method?:

df1 %>% 
group_by(File) %>%
mutate(rn = row_number()) %>%
rows_update(df2 %>% mutate(rn = row_number()), by = c("File", "rn")) %>%
select(-rn)
# A tibble: 6 × 5
# Groups: File [4]
Utt File x y z
<chr> <chr> <int> <chr> <dbl>
1 xyzxyz F01 1 A 0.451
2 hi there F02 2 B 1.66
3 how are you? F02 3 C 0.505
4 xxxxx F03 4 D 0.757
5 yyzyzyz F03 5 E 1.28
6 hybsfc F12 6 F 0.226

dplyr - use join as vlookup to capture some value from different data frame

If I understand correctly, you want to replace price and sale_price in df_b with PRECIO_LISTA and PRECIO_INDEXADO in df_a, joining only by id = SKU.

You can first discard the current price and sale_price from df_b using select() and then join with df_a keeping only the variables you are interested into. Moreover you can use distinct() to avoid duplicates.

Using the pipes:

library(dplyr)
desired_result =
df_b %>% select(-price,-sale_price) %>%
left_join(
df_a %>%
transmute(id = SKU, price = PRECIO_LISTA, sale_price = PRECIO_INDEXADO) %>%
distinct()
)

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

Replacing multiple observations from one column with values from another column in R

You can left_join the two data frames and then use an if_else statement to get the right locx and locy. Try out:

library(dplyr)
df %>% left_join(df2, by = "squirrel_id") %>%
mutate(locx = if_else(is.na(new_locx), locx, new_locx), # as suggested by @echasnovski, we can also use locx = coalesce(new_locx, locx)
locy = if_else(is.na(new_locy), locy, new_locy)) %>% # or locy = coalesce(new_locy, locy)
select(-new_locx, -new_locy)
# output
squirrel_id locx locy dist
1 6391 12.5 15.5 50.0
2 6391 12.5 15.5 20.0
3 6391 12.5 15.5 15.5
4 8443 2.5 80.0 800.0
5 6025 -55.0 0.0 0.0
6 5000 18.5 18.5 10.0 # squirrel_id 5000 was created for an example of id
# present if df but not in df2

Data

df <- structure(list(squirrel_id = c(6391L, 6391L, 6391L, 8443L, 6025L, 
5000L), locx = c(17.5, 17.5, 17.5, 20.5, -5, 18.5), locy = c(10,
10, 10, 1, -0.5, 12.5), dist = c(50, 20, 15.5, 800, 0, 10)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(squirrel_id = c(6391L, 8443L, 6025L), new_locx = c(12.5,
2.5, -55), new_locy = c(15.5, 80, 0)), class = "data.frame", row.names = c(NA,
-3L))

Update/Replace Values in Dataframe with Tidyverse Join

Picking up Alistaire's and Nettle's suggestions and transforming into a working solution

df1 %>% 
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
# A tibble: 10 x 3
state_abbrev value state_name
<chr> <int> <chr>
1 AL 671 Alabama
2 AK 501 Alaska
3 AZ 1030 Arizona
4 AR 694 Arkansas
5 CA 881 California
6 CO 821 Colorado
7 CT 742 Connecticut
8 DE 665 Delaware
9 FL 948 Florida
10 GA 790 Georgia

The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table package:

library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
    state_abbrev  state_name value
1: AL Alabama 1103
2: AK Alaska 1036
3: AZ Arizona 811
4: AR Arkansas 604
5: CA California 868
6: CO Colorado 1129
7: CT Connecticut 819
8: DE Delaware 1194
9: FL Florida 888
10: GA Georgia 501

Benchmark

library(bench)
bm <- press(
na_share = c(0.1, 0.5, 0.9),
n_row = length(state.abb) * 2 * c(1, 100, 10000),
{
n_na <- na_share * length(state.abb)
set.seed(1)
na_idx <- sample(length(state.abb), n_na)
tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
lookup_df <-tmp[na_idx]
tmp[na_idx, state_name := NA]
df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
mark(
dplyr = {
df1 <- copy(df0)
df1 <- df1 %>%
left_join(lookup_df, by = "state_abbrev") %>%
mutate(state_name = coalesce(state_name.x, state_name.y)) %>%
select(-state_name.x, -state_name.y)
df1
},
upd_join = {
df1 <- copy(df0)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
}
)
}
)
ggplot2::autoplot(bm)

Sample Image

data.table's upate join is always faster (note the log time scale).

As the update join modifies the data object, a fresh copy is used for each benchmark run.

Conditional Left Join in dplyr using subset

As far as I could tell there were three issues with your code: you used subset() twice, referenced new_df in your dplyr pipe which is unnecessary and will cause issues, and your dates were not in R's default format.

The code below should work for you and hopefully clarify where your errors were.

new_df <- left_join(df2, df1, by = "ID") %>%
subset(as.Date(Start Date, "%m-%d-%Y") > as.Date(START_DT, "%m-%d-%Y") & as.Date(End Date, "%m-%d-%Y") <= as.Date(END_DT, "%m-%d-%Y"))

Another alternative is to use filter():

new_df <- left_join(df2, df1, by = "ID") %>%
filter(as.Date(Start Date, "%m-%d-%Y") > as.Date(START_DT, "%m-%d-%Y") & as.Date(End Date, "%m-%d-%Y") <= as.Date(END_DT, "%m-%d-%Y"))

I hope this helps!

Is there a simple (dplyr-friendly) method to mask a subset of subject IDs in a data frame?

You can do this in fewer steps, but I find something like this more clear and easy to follow.

The key step is that I would split the ID into two fields: STUDY and ID. You can then search for the STUDY numbers that you want to mask and replace their value.

Then just paste the fields back together (or don't), ungroup, and select the columns you want.

library(dplyr)
library(tidyr)

studies_to_mask <- c("STUDY2")

df %>%
separate(ID, c("STUDY", "ID"), sep = "_") %>%
group_by(STUDY) %>%
mutate(ID_MASK = if_else(STUDY %in% studies_to_mask, paste0("MASK", 1:n()), ID),
ID = paste(STUDY, ID_MASK, sep = "_")) %>%
ungroup() %>%
select(ID, DV)

Results in:

# A tibble: 9 x 2
ID DV
<chr> <dbl>
1 STUDY1_ID1001 10
2 STUDY1_ID1002 15
3 STUDY1_ID1003 20
4 STUDY2_MASK1 25
5 STUDY2_MASK2 30
6 STUDY2_MASK3 35
7 STUDY3_ID3002 40
8 STUDY3_ID3003 45
9 STUDY3_ID3007 50


Related Topics



Leave a reply



Submit