Replace a Subset of a Data Frame with Dplyr Join Operations

Replace a subset of a data frame with dplyr join operations

What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).

Here's an example for your toy data:

library(data.table)
setDT(df)             # convert to data.table without copy
setDT(sub_df)         # convert to data.table without copy

# join and update "df" by reference, i.e. without copy 
df[sub_df, on = c("id", "animal"), weight := i.weight]

The data is now updated:

#   id animal weight
#1:  1    dog   23.0
#2:  2    cat    2.2
#3:  3   duck    1.2
#4:  4  fairy    0.2
#5:  5  snake    1.3

You can use setDF to switch back to ordinary data.frame.

dplyr mutate/replace several columns on a subset of rows

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}

DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

4) row_case_when Also check out row_case_when defined in
Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.

library(dplyr)

DF %>%
  row_case_when(
    measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
    TRUE ~ data.frame(qty.exit, cf, delta.watts)
  )

Note 1: We used this as DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

Partially replace grouped data in one dataframe with data from other dataframe

What do you think of this method?:

df1 %>% 
  group_by(File) %>% 
  mutate(rn = row_number()) %>% 
  rows_update(df2 %>% mutate(rn = row_number()), by = c("File", "rn")) %>% 
  select(-rn)
# A tibble: 6 × 5
# Groups:   File [4]
  Utt          File      x y         z
  <chr>        <chr> <int> <chr> <dbl>
1 xyzxyz       F01       1 A     0.451
2 hi there     F02       2 B     1.66 
3 how are you? F02       3 C     0.505
4 xxxxx        F03       4 D     0.757
5 yyzyzyz      F03       5 E     1.28 
6 hybsfc       F12       6 F     0.226

dplyr - use join as vlookup to capture some value from different data frame

If I understand correctly, you want to replace price and sale_price in df_b with PRECIO_LISTA and PRECIO_INDEXADO in df_a, joining only by id = SKU.

You can first discard the current price and sale_price from df_b using select() and then join with df_a keeping only the variables you are interested into. Moreover you can use distinct() to avoid duplicates.

Using the pipes:

library(dplyr)
desired_result = 
  df_b %>% select(-price,-sale_price) %>% 
  left_join(
    df_a %>%
      transmute(id = SKU, price = PRECIO_LISTA, sale_price = PRECIO_INDEXADO) %>%
      distinct()
  )

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

~~Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.~~ I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

Replacing multiple observations from one column with values from another column in R

You can left_join the two data frames and then use an if_else statement to get the right locx and locy. Try out:

library(dplyr)
df %>% left_join(df2, by = "squirrel_id") %>%
        mutate(locx = if_else(is.na(new_locx), locx, new_locx), # as suggested by @echasnovski, we can also use locx = coalesce(new_locx, locx)
               locy = if_else(is.na(new_locy), locy, new_locy)) %>% # or locy = coalesce(new_locy, locy)
        select(-new_locx, -new_locy)
# output
  squirrel_id  locx locy  dist
1        6391  12.5 15.5  50.0
2        6391  12.5 15.5  20.0
3        6391  12.5 15.5  15.5
4        8443   2.5 80.0 800.0
5        6025 -55.0  0.0   0.0
6        5000  18.5 18.5  10.0 # squirrel_id 5000 was created for an example of id 
# present if df but not in df2

Data

df <- structure(list(squirrel_id = c(6391L, 6391L, 6391L, 8443L, 6025L, 
5000L), locx = c(17.5, 17.5, 17.5, 20.5, -5, 18.5), locy = c(10, 
10, 10, 1, -0.5, 12.5), dist = c(50, 20, 15.5, 800, 0, 10)), class = "data.frame", row.names = c(NA, 
-6L))
df2 <- structure(list(squirrel_id = c(6391L, 8443L, 6025L), new_locx = c(12.5, 
2.5, -55), new_locy = c(15.5, 80, 0)), class = "data.frame", row.names = c(NA, 
-3L))

Update/Replace Values in Dataframe with Tidyverse Join

Picking up Alistaire's and Nettle's suggestions and transforming into a working solution

df1 %>% 
  left_join(lookup_df, by = "state_abbrev") %>% 
  mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
  select(-state_name.x, -state_name.y)

# A tibble: 10 x 3
   state_abbrev value state_name 
   <chr>        <int> <chr>      
 1 AL             671 Alabama    
 2 AK             501 Alaska     
 3 AZ            1030 Arizona    
 4 AR             694 Arkansas   
 5 CA             881 California 
 6 CO             821 Colorado   
 7 CT             742 Connecticut
 8 DE             665 Delaware   
 9 FL             948 Florida    
10 GA             790 Georgia

The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table package:

library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1

    state_abbrev  state_name value
 1:           AL     Alabama  1103
 2:           AK      Alaska  1036
 3:           AZ     Arizona   811
 4:           AR    Arkansas   604
 5:           CA  California   868
 6:           CO    Colorado  1129
 7:           CT Connecticut   819
 8:           DE    Delaware  1194
 9:           FL     Florida   888
10:           GA     Georgia   501

Benchmark

library(bench)
bm <- press(
  na_share = c(0.1, 0.5, 0.9),
  n_row = length(state.abb) * 2 * c(1, 100, 10000),
  {
    n_na <- na_share * length(state.abb)
    set.seed(1)
    na_idx <- sample(length(state.abb), n_na)
    tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
    lookup_df <-tmp[na_idx] 
    tmp[na_idx, state_name := NA]
    df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
    mark(
      dplyr = {
        df1 <- copy(df0)
        df1 <- df1 %>% 
          left_join(lookup_df, by = "state_abbrev") %>% 
          mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
          select(-state_name.x, -state_name.y)
        df1
      },
      upd_join = {
        df1 <- copy(df0)
        setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
        df1
      }
    )
  }
)
ggplot2::autoplot(bm)

Sample Image

data.table's upate join is always faster (note the log time scale).

As the update join modifies the data object, a fresh copy is used for each benchmark run.

Conditional Left Join in dplyr using subset

As far as I could tell there were three issues with your code: you used subset() twice, referenced new_df in your dplyr pipe which is unnecessary and will cause issues, and your dates were not in R's default format.

The code below should work for you and hopefully clarify where your errors were.

new_df <- left_join(df2, df1, by = "ID") %>%
  subset(as.Date(Start Date, "%m-%d-%Y") > as.Date(START_DT, "%m-%d-%Y") & as.Date(End Date, "%m-%d-%Y") <= as.Date(END_DT, "%m-%d-%Y"))

Another alternative is to use filter():

new_df <- left_join(df2, df1, by = "ID") %>%
  filter(as.Date(Start Date, "%m-%d-%Y") > as.Date(START_DT, "%m-%d-%Y") & as.Date(End Date, "%m-%d-%Y") <= as.Date(END_DT, "%m-%d-%Y"))

I hope this helps!

Is there a simple (dplyr-friendly) method to mask a subset of subject IDs in a data frame?

You can do this in fewer steps, but I find something like this more clear and easy to follow.

The key step is that I would split the ID into two fields: STUDY and ID. You can then search for the STUDY numbers that you want to mask and replace their value.

Then just paste the fields back together (or don't), ungroup, and select the columns you want.

library(dplyr)
library(tidyr)

studies_to_mask <- c("STUDY2")

df %>% 
  separate(ID, c("STUDY", "ID"), sep = "_") %>% 
  group_by(STUDY) %>% 
  mutate(ID_MASK = if_else(STUDY %in% studies_to_mask, paste0("MASK", 1:n()), ID),
         ID = paste(STUDY, ID_MASK, sep = "_")) %>% 
  ungroup() %>% 
  select(ID, DV)

Results in:

# A tibble: 9 x 2
  ID               DV
  <chr>         <dbl>
1 STUDY1_ID1001    10
2 STUDY1_ID1002    15
3 STUDY1_ID1003    20
4 STUDY2_MASK1     25
5 STUDY2_MASK2     30
6 STUDY2_MASK3     35
7 STUDY3_ID3002    40
8 STUDY3_ID3003    45
9 STUDY3_ID3007    50

Replace a Subset of a Data Frame with Dplyr Join Operations