How to Specify Names of Columns for X and Y When Joining in Dplyr

How to specify names of columns for x and y when joining in dplyr?

This feature has been added in dplyr v0.3. You can now pass a named character vector to the by argument in left_join (and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:

left_join(test_data, kantrowitz, by = c("first_name" = "name"))

How to pass column names for inner join by 2 column sets as variables with dplyr

An option would be to use setNames

 ...
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, var)))
...

Full code

mtcars %>% 
mutate(rn = row_number()) %>%
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, yvar))) %>%
pull(rn)
#[1] 1 2 3 3 5 9 9 10 11

actual full code:

mtcars %>%
mutate(Selected = if_else(row_number() %in% {
mtcars %>%
mutate(rn = row_number()) %>%
inner_join(distinct(newData), by = setNames(c('x', 'y'), c(xvar, yvar))) %>%
pull(rn)
},
!Selected, Selected))

Join creates .x and .y column but they have identical content - why?

Let's take this simple example.

library(dplyr)
set.seed(123)
df1 <- data.frame(a = 1:4, b = 1:4, c = rnorm(4))
df2 <- data.frame(a = 4:1, b = 4:1, c = rnorm(4))
df1

# a b c
#1 1 1 -0.56047565
#2 2 2 -0.23017749
#3 3 3 1.55870831
#4 4 4 0.07050839

df2
# a b c
#1 4 4 0.1292877
#2 3 3 1.7150650
#3 2 2 0.4609162
#4 1 1 -1.2650612

Notice the values in column a and b are the same in both the dataframes (although the order is different).

When you join only by a you get

df1 %>% left_join(df2, by = 'a')
# a b.x c.x b.y c.y
#1 1 1 -0.56047565 1 -1.2650612
#2 2 2 -0.23017749 2 0.4609162
#3 3 3 1.55870831 3 1.7150650
#4 4 4 0.07050839 4 0.1292877

You have told to join only by a so it will match only a column, rest of the columns are treated differently even if their values are the same. Hence you get b.x, b.y as well c.x and c.y.

If you want that b.x and b.y should not be generated as they are the same specify it in by.

df1 %>% left_join(df2, by = c('a', 'b'))

# a b c.x c.y
#1 1 1 -0.56047565 -1.2650612
#2 2 2 -0.23017749 0.4609162
#3 3 3 1.55870831 1.7150650
#4 4 4 0.07050839 0.1292877

Now you get only c.x and c.y additional columns.

Avoiding and renaming .x and .y columns when merging or joining in r

Currently, this is an open issue with dplyr. You'll either have to rename before or after the join or use merge from base R, which takes a suffixes argument.

How to join two dataframes with dplyr based on two columns with different names in each dataframe?

df3 <- dplyr::left_join(df1, df2, by=c("name1" = "name3", "name2" = "name4"))

tidyverse join two data sets with dynamic column names for by column

left_join(x, y, by = setNames(y_name, x_name))
# a1 new
# 1 1 a
# 2 2 b
# 3 3 <NA>

What your by= vector was doing

setNames(c('x', 'y'), c(x_name, y_name))
# a1 a2
# "x" "y"

Whereas you needed

c(a1 = "a2")
# a1
# "a2"
setNames(y_name, x_name)
# a1
# "a2"

use dplyr to combine columns of data.frame when column names are not known

With a little trial and error:

colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))

Here was the hint that put me on to the solution... From the documentation for !!!:

The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.

vars <- syms(c("height", "mass"))

Force-splicing is equivalent to supplying the elements separately:

starwars %>% select(!!!vars)
starwars %>% select(height, mass)

In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

Can dplyr join on multiple columns or composite key?

Updating to use tibble()

You can pass a named vector of length greater than 1 to the by argument of left_join():

library(dplyr)

d1 <- tibble(
x = letters[1:3],
y = LETTERS[1:3],
a = rnorm(3)
)

d2 <- tibble(
x2 = letters[3:1],
y2 = LETTERS[3:1],
b = rnorm(3)
)

left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))


Related Topics



Leave a reply



Submit