Merge Multiple Variables in R

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

Combining multiple columns/variables into a single column

That is what dplyr::coalesce was made for:

library(dplyr)
df$v4 <- coalesce(!!!df)

#Also works:
df %>%
mutate(v4 = coalesce(v1, v2, v3))

output

   v1 v2 v3 v4
1 1 NA NA 1
2 3 NA NA 3
3 6 NA NA 6
4 NA 5 NA 5
5 NA 1 NA 1
6 NA 3 NA 3
7 NA NA 4 4
8 NA NA 2 2
9 NA NA 1 1
10 NA NA NA NA

How to merge multiple variables and create a new data set?

You'll have to join those four tables, not combine using c.

And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.

I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.

batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)

# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows

There are also 2 batsmen who didn't face any delivery:

batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0

One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

Merging data by 2 variables in R

Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.

I'll call your data winlose and teamrank, respectively. Then you'd need:

first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))

Renaming the variables:

names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'

How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?

Try to set the inputs of the function pivot_longer()correctly as cols and values_to. cols=... defines the columns which you are taking the values from. values_to = ... defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer returns always the names of the columns which values you are taking from, unless you try other trickier things.

library(tidyverse)

df = data.frame(
a = c("string1","string2"),
b= c("string11","string12"),
c = c("string21", "string22"),
ID = c("1111","2222")
)

df %>%
pivot_longer(cols = names(df)[1:3],
values_to = "newvar") %>%
select(newvar, ID)

Output:

# A tibble: 6 x 2
newvar ID
<chr> <chr>
1 string1 1111
2 string11 1111
3 string21 1111
4 string2 2222
5 string12 2222
6 string22 2222


Related Topics



Leave a reply



Submit