Merge Multiple Variables in R

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
 but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

Combining multiple columns/variables into a single column

That is what dplyr::coalesce was made for:

library(dplyr)
df$v4 <- coalesce(!!!df)

#Also works:
df %>% 
  mutate(v4 = coalesce(v1, v2, v3))

output

   v1 v2 v3 v4
1   1 NA NA  1
2   3 NA NA  3
3   6 NA NA  6
4  NA  5 NA  5
5  NA  1 NA  1
6  NA  3 NA  3
7  NA NA  4  4
8  NA NA  2  2
9  NA NA  1  1
10 NA NA NA NA

How to merge multiple variables and create a new data set?

You'll have to join those four tables, not combine using c.

And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.

I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.

batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
  left_join(fours_hit) %>%
  left_join(sixes_hit) %>%
  select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
  replace(is.na(.), 0)

# A tibble: 11,335 x 7
      id inning batsman       total_batsman_runs deliveries_played fours_hit sixes_hit
   <int>  <int> <fct>                      <int>             <dbl>     <dbl>     <dbl>
 1     1      1 DA Warner                     14                 8         2         1
 2     1      1 S Dhawan                      40                31         5         0
 3     1      1 MC Henriques                  52                37         3         2
 4     1      1 Yuvraj Singh                  62                27         7         3
 5     1      1 DJ Hooda                      16                12         0         1
 6     1      1 BCJ Cutting                   16                 6         0         2
 7     1      2 CH Gayle                      32                21         2         3
 8     1      2 Mandeep Singh                 24                16         5         0
 9     1      2 TM Head                       30                22         3         0
10     1      2 KM Jadhav                     31                16         4         1
# ... with 11,325 more rows

There are also 2 batsmen who didn't face any delivery:

batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
     id inning batsman        total_batsman_runs deliveries_played fours_hit sixes_hit
  <int>  <int> <fct>                       <int>             <dbl>     <dbl>     <dbl>
1   482      2 MK Pandey                       0                 0         0         0
2  7907      1 MJ McClenaghan                  2                 0         0         0

One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

Merging data by 2 variables in R

Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.

I'll call your data winlose and teamrank, respectively. Then you'd need:

first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))

Renaming the variables:

names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'

How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?

Try to set the inputs of the function pivot_longer()correctly as cols and values_to. cols=... defines the columns which you are taking the values from. values_to = ... defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer returns always the names of the columns which values you are taking from, unless you try other trickier things.

library(tidyverse)

df = data.frame(
  a = c("string1","string2"),
  b= c("string11","string12"),
  c = c("string21", "string22"),
  ID = c("1111","2222")
)

df %>% 
  pivot_longer(cols = names(df)[1:3],
                    values_to = "newvar") %>% 
  select(newvar, ID)

Output:

# A tibble: 6 x 2
  newvar   ID   
  <chr>    <chr>
1 string1  1111 
2 string11 1111 
3 string21 1111 
4 string2  2222 
5 string12 2222 
6 string22 2222