How do I combine two data-frames based on two columns?
See the documentation on ?merge
, which states:
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y.
This clearly implies that merge
will merge data frames based on more than one column. From the final example given in the documentation:
x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match
This example was meant to demonstrate the use of incomparables
, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x
and y
using by.x
and by.y
.
Combining multiple columns/variables into a single column
That is what dplyr::coalesce
was made for:
library(dplyr)
df$v4 <- coalesce(!!!df)
#Also works:
df %>%
mutate(v4 = coalesce(v1, v2, v3))
output
v1 v2 v3 v4
1 1 NA NA 1
2 3 NA NA 3
3 6 NA NA 6
4 NA 5 NA 5
5 NA 1 NA 1
6 NA 3 NA 3
7 NA NA 4 4
8 NA NA 2 2
9 NA NA 1 1
10 NA NA NA NA
How to merge multiple variables and create a new data set?
You'll have to join those four tables, not combine using c
.
And the join type is left_join
so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by
since dplyr will assume you want c("id", "inning", "batsman")
, the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs
column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.
Merging data by 2 variables in R
Two separate merges. You would need to wrap your list of by
variables in c()
, and since the variables have different names, you need by.x
and by.y
. Afterward you could rename the rank variables.
I'll call your data winlose
and teamrank
, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?
Try to set the inputs of the function pivot_longer()
correctly as cols and values_to. cols=...
defines the columns which you are taking the values from. values_to = ...
defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer
returns always the names of the columns which values you are taking from, unless you try other trickier things.
library(tidyverse)
df = data.frame(
a = c("string1","string2"),
b= c("string11","string12"),
c = c("string21", "string22"),
ID = c("1111","2222")
)
df %>%
pivot_longer(cols = names(df)[1:3],
values_to = "newvar") %>%
select(newvar, ID)
Output:
# A tibble: 6 x 2
newvar ID
<chr> <chr>
1 string1 1111
2 string11 1111
3 string21 1111
4 string2 2222
5 string12 2222
6 string22 2222
Related Topics
Edit Datatable in Shiny with Dropdown Selection for Factor Variables
Compute Monthly Averages from Daily Data
R: Ggplot Stacked Bar Chart with Counts on Y Axis But Percentage as Label
Adding Column If It Does Not Exist
How to Rbind Vectors Matching Their Column Names
Ggplot2: Drop Unused Factors in a Faceted Bar Plot But Not Have Differing Bar Widths Between Facets
How to Combine Ggplot and Dplyr into a Function
Reshape Multiple Categorical Variables to Binary Response Variables
How to Add Annotations Below the X Axis in Ggplot2
Convert 12 Hour Character Time to 24 Hour
Propagating Data Within a Vector
Reshaping Data Frame with Duplicates
Accurately Converting from Character->Posixct->Character with Sub Millisecond Datetimes
Can You Specify Different Geoms for Different Facets in a Ggplot