Merging on Multiple Columns R

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

merging on multiple columns R

Try this:

library(dplyr)

full_join(data1, data2, by = c("id", "spp"))

Output:

  spp trait.1 id trait.2
1 A 1.1 1 NA
2 B 1.1 2 2
3 C 1.1 3 NA
4 C NA 9 2
5 D NA 7 2

Alternatively, also merge would work:

merge(data1, data2, by = c("id", "spp"), all = TRUE)

Merging two dataframes by multiple columns without losing data

merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)

 total <- merge(df1,df2,by=c("id","year"), all=TRUE)

How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?

Try to set the inputs of the function pivot_longer()correctly as cols and values_to. cols=... defines the columns which you are taking the values from. values_to = ... defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer returns always the names of the columns which values you are taking from, unless you try other trickier things.

library(tidyverse)

df = data.frame(
a = c("string1","string2"),
b= c("string11","string12"),
c = c("string21", "string22"),
ID = c("1111","2222")
)

df %>%
pivot_longer(cols = names(df)[1:3],
values_to = "newvar") %>%
select(newvar, ID)

Output:

# A tibble: 6 x 2
newvar ID
<chr> <chr>
1 string1 1111
2 string11 1111
3 string21 1111
4 string2 2222
5 string12 2222
6 string22 2222

merge dataframes based on multiple columns and thresholds

I first turned the city names into character vectors, since (if I understood correctly) you want to include city names that are contained within df2.

df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)

Then merge them by country:

df = merge(df1, df2, by = ("ctry"))

> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue

The library stringr will allow you to see if city.x is within city.y here (see last column):

library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE

Then you can get the difference in days between dates:

df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)

and the difference in numbers:

df$numDiff<-abs(df$number - df$other_number)

Here was what the resulting dataframe looks like:

> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070

But we want to drop things where city.x was not found within city.y, where the day difference is greater than 5 or the number difference is greater than 3:

df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]

> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3

What is left are the three rows that you had above (which contained dots in column 1).

Now we can drop the three columns we created, and the date and city from df2:

> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow

How can I combine multiple columns into one in an R dataset?

A solution using tidyverse. dat4 is the final output.

library(tidyverse)

dat2 <- dat %>%
mutate(ID = 1:n())

dat3 <- dat2 %>%
pivot_longer(a:f, names_to = "value", values_to = "number") %>%
filter(number == 1) %>%
select(-number)

dat4 <- dat2 %>%
left_join(dat3) %>%
select(-ID, -c(a:f)) %>%
replace_na(list(value = "none"))

dat4
# age gender race insured value
# 1 13 Female white 0 none
# 2 12 Female white 1 none
# 3 19 Male other 0 f
# 4 19 Female white 0 b
# 5 13 Female white 0 a
# 6 13 Female white 0 b
# 7 13 Female white 0 f

DATA

dat <- read.table(text = "      age gender a     b     c     d     e     f     race  insured 
1 13 Female 0 0 0 0 0 0 white 0
2 12 Female 0 0 0 0 0 0 white 1
3 19 Male 0 0 0 0 0 1 other 0
4 19 Female 0 1 0 0 0 0 white 0
5 13 Female 1 1 0 0 0 1 white 0",
header = TRUE)

Merging two dataframes by two columns resulting in blank df

You should first tell R that both liver_date columns are dates. The function as.Date let you do that.

So let`s say we got df1 and df2

date1<-(c("2007-08-01", "2004-10-05", "2014-03-09"))#Year - Month - Day
date2<-(c("8/1/07", "10/5/04", "3/9/14"))#Month/Day/Year
x<-(c(1:3))
z<-c(11:13)
w<-c(11:13)

df1<-data.frame(date1, x, z)
str(df1$data1)

df1

> df1
date1 x z
1 2007-08-01 1 11
2 2004-10-05 2 12
3 2014-03-09 3 13

df2<-data.frame(date2, x, w)
str(df2$date2)

df2
> df2
date2 x w
1 8/1/07 1 11
2 10/5/04 2 12
3 3/9/14 3 13

With as.Date you tell the format of the column in which the dates are, for df1 is a Y-M-D

df1$date1<-as.Date.character(df1$date1,format="%Y-%m-%d")
str(df1$date1)

And for df2 is m/d/y

df2$date1<-as.Date.character(df2$date2,format="%m/%d/%y")
str(df2$date1)

We recode the df2$date2 in df2$date1 to match the name of the column, this will be needed by the merge function later, in your case you could recode in the same column because they have the same name:

df3<-merge(df1,df2, by =c("date1", "x" ) )
df3

>df3
date1 x z date2 w
1 2004-10-05 2 12 10/5/04 12
2 2007-08-01 1 11 8/1/07 11
3 2014-03-09 3 13 3/9/14 13

As you can see, z and w match perfectly, so we know we did it right.

In your data:

df1 = qtpo_liver_dates

df2 = labs_v500

date1, date2 = liver_date

x = patient_num

z = Some column in qtpo_liver_dates

w = Some column in labs_v500



Related Topics



Leave a reply



Submit