How to specify names of columns for x and y when joining in dplyr?
This feature has been added in dplyr v0.3. You can now pass a named character vector to the by
argument in left_join
(and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:
left_join(test_data, kantrowitz, by = c("first_name" = "name"))
How to pass column names for inner join by 2 column sets as variables with dplyr
An option would be to use setNames
...
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, var)))
...
Full code
mtcars %>%
mutate(rn = row_number()) %>%
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, yvar))) %>%
pull(rn)
#[1] 1 2 3 3 5 9 9 10 11
actual full code:
mtcars %>%
mutate(Selected = if_else(row_number() %in% {
mtcars %>%
mutate(rn = row_number()) %>%
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, yvar))) %>%
pull(rn)
},
!Selected, Selected))
Join creates .x and .y column but they have identical content - why?
Let's take this simple example.
library(dplyr)
set.seed(123)
df1 <- data.frame(a = 1:4, b = 1:4, c = rnorm(4))
df2 <- data.frame(a = 4:1, b = 4:1, c = rnorm(4))
df1
# a b c
#1 1 1 -0.56047565
#2 2 2 -0.23017749
#3 3 3 1.55870831
#4 4 4 0.07050839
df2
# a b c
#1 4 4 0.1292877
#2 3 3 1.7150650
#3 2 2 0.4609162
#4 1 1 -1.2650612
Notice the values in column a
and b
are the same in both the dataframes (although the order is different).
When you join only by a
you get
df1 %>% left_join(df2, by = 'a')
# a b.x c.x b.y c.y
#1 1 1 -0.56047565 1 -1.2650612
#2 2 2 -0.23017749 2 0.4609162
#3 3 3 1.55870831 3 1.7150650
#4 4 4 0.07050839 4 0.1292877
You have told to join only by a
so it will match only a
column, rest of the columns are treated differently even if their values are the same. Hence you get b.x
, b.y
as well c.x
and c.y
.
If you want that b.x
and b.y
should not be generated as they are the same specify it in by
.
df1 %>% left_join(df2, by = c('a', 'b'))
# a b c.x c.y
#1 1 1 -0.56047565 -1.2650612
#2 2 2 -0.23017749 0.4609162
#3 3 3 1.55870831 1.7150650
#4 4 4 0.07050839 0.1292877
Now you get only c.x
and c.y
additional columns.
Avoiding and renaming .x and .y columns when merging or joining in r
Currently, this is an open issue with dplyr. You'll either have to rename
before or after the join or use merge
from base R, which takes a suffixes
argument.
How to join two dataframes with dplyr based on two columns with different names in each dataframe?
df3 <- dplyr::left_join(df1, df2, by=c("name1" = "name3", "name2" = "name4"))
tidyverse join two data sets with dynamic column names for by column
left_join(x, y, by = setNames(y_name, x_name))
# a1 new
# 1 1 a
# 2 2 b
# 3 3 <NA>
What your by=
vector was doing
setNames(c('x', 'y'), c(x_name, y_name))
# a1 a2
# "x" "y"
Whereas you needed
c(a1 = "a2")
# a1
# "a2"
setNames(y_name, x_name)
# a1
# "a2"
use dplyr to combine columns of data.frame when column names are not known
With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!
:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show
Can dplyr join on multiple columns or composite key?
Updating to use tibble()
You can pass a named vector of length greater than 1 to the by
argument of left_join()
:
library(dplyr)
d1 <- tibble(
x = letters[1:3],
y = LETTERS[1:3],
a = rnorm(3)
)
d2 <- tibble(
x2 = letters[3:1],
y2 = LETTERS[3:1],
b = rnorm(3)
)
left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))
Related Topics
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
How to Prep Transaction Data into Basket for Arules
Manipulating Files with Non-English Names in R
Differencebetween Short (&,|) and Long (&&, ||) Forms of And, or Logical Operators in R
Blend of Na.Omit and Na.Pass Using Aggregate
Plot a Character Vector Against a Numeric Vector in R
How to Split Data Frame by Column Names in R
Can Sparklyr Be Used with Spark Deployed on Yarn-Managed Hadoop Cluster
Reshaping Data to Plot in R Using Ggplot2
Large Integers in Data.Table. Grouping Results Different in 1.9.2 Compared to 1.8.10
How to Retrieve the Client's Current Time and Time Zone When Using Shiny
Add a Dynamic Value into Rmysql Getquery