How do I combine two data-frames based on two columns?
See the documentation on ?merge
, which states:
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y.
This clearly implies that merge
will merge data frames based on more than one column. From the final example given in the documentation:
x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match
This example was meant to demonstrate the use of incomparables
, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x
and y
using by.x
and by.y
.
merging on multiple columns R
Try this:
library(dplyr)
full_join(data1, data2, by = c("id", "spp"))
Output:
spp trait.1 id trait.2
1 A 1.1 1 NA
2 B 1.1 2 2
3 C 1.1 3 NA
4 C NA 9 2
5 D NA 7 2
Alternatively, also merge
would work:
merge(data1, data2, by = c("id", "spp"), all = TRUE)
Merging two dataframes by multiple columns without losing data
merge
has an argument all
that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)
total <- merge(df1,df2,by=c("id","year"), all=TRUE)
How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?
Try to set the inputs of the function pivot_longer()
correctly as cols and values_to. cols=...
defines the columns which you are taking the values from. values_to = ...
defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer
returns always the names of the columns which values you are taking from, unless you try other trickier things.
library(tidyverse)
df = data.frame(
a = c("string1","string2"),
b= c("string11","string12"),
c = c("string21", "string22"),
ID = c("1111","2222")
)
df %>%
pivot_longer(cols = names(df)[1:3],
values_to = "newvar") %>%
select(newvar, ID)
Output:
# A tibble: 6 x 2
newvar ID
<chr> <chr>
1 string1 1111
2 string11 1111
3 string21 1111
4 string2 2222
5 string12 2222
6 string22 2222
merge dataframes based on multiple columns and thresholds
I first turned the city names into character vectors, since (if I understood correctly) you want to include city names that are contained within df2.
df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)
Then merge them by country:
df = merge(df1, df2, by = ("ctry"))
> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue
The library stringr
will allow you to see if city.x is within city.y here (see last column):
library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE
Then you can get the difference in days between dates:
df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)
and the difference in numbers:
df$numDiff<-abs(df$number - df$other_number)
Here was what the resulting dataframe looks like:
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070
But we want to drop things where city.x was not found within city.y, where the day difference is greater than 5 or the number difference is greater than 3:
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
What is left are the three rows that you had above (which contained dots in column 1).
Now we can drop the three columns we created, and the date and city from df2:
> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow
How can I combine multiple columns into one in an R dataset?
A solution using tidyverse
. dat4
is the final output.
library(tidyverse)
dat2 <- dat %>%
mutate(ID = 1:n())
dat3 <- dat2 %>%
pivot_longer(a:f, names_to = "value", values_to = "number") %>%
filter(number == 1) %>%
select(-number)
dat4 <- dat2 %>%
left_join(dat3) %>%
select(-ID, -c(a:f)) %>%
replace_na(list(value = "none"))
dat4
# age gender race insured value
# 1 13 Female white 0 none
# 2 12 Female white 1 none
# 3 19 Male other 0 f
# 4 19 Female white 0 b
# 5 13 Female white 0 a
# 6 13 Female white 0 b
# 7 13 Female white 0 f
DATA
dat <- read.table(text = " age gender a b c d e f race insured
1 13 Female 0 0 0 0 0 0 white 0
2 12 Female 0 0 0 0 0 0 white 1
3 19 Male 0 0 0 0 0 1 other 0
4 19 Female 0 1 0 0 0 0 white 0
5 13 Female 1 1 0 0 0 1 white 0",
header = TRUE)
Merging two dataframes by two columns resulting in blank df
You should first tell R that both liver_date
columns are dates. The function as.Date
let you do that.
So let`s say we got df1
and df2
date1<-(c("2007-08-01", "2004-10-05", "2014-03-09"))#Year - Month - Day
date2<-(c("8/1/07", "10/5/04", "3/9/14"))#Month/Day/Year
x<-(c(1:3))
z<-c(11:13)
w<-c(11:13)
df1<-data.frame(date1, x, z)
str(df1$data1)
df1
> df1
date1 x z
1 2007-08-01 1 11
2 2004-10-05 2 12
3 2014-03-09 3 13
df2<-data.frame(date2, x, w)
str(df2$date2)
df2
> df2
date2 x w
1 8/1/07 1 11
2 10/5/04 2 12
3 3/9/14 3 13
With as.Date
you tell the format of the column in which the dates are, for df1
is a Y-M-D
df1$date1<-as.Date.character(df1$date1,format="%Y-%m-%d")
str(df1$date1)
And for df2
is m/d/y
df2$date1<-as.Date.character(df2$date2,format="%m/%d/%y")
str(df2$date1)
We recode the df2$date2
in df2$date1
to match the name of the column, this will be needed by the merge
function later, in your case you could recode in the same column because they have the same name:
df3<-merge(df1,df2, by =c("date1", "x" ) )
df3
>df3
date1 x z date2 w
1 2004-10-05 2 12 10/5/04 12
2 2007-08-01 1 11 8/1/07 11
3 2014-03-09 3 13 3/9/14 13
As you can see, z
and w
match perfectly, so we know we did it right.
In your data:
df1
= qtpo_liver_dates
df2
= labs_v500
date1
, date2
= liver_date
x
= patient_num
z
= Some column in qtpo_liver_dates
w
= Some column in labs_v500
Related Topics
Solve Homogenous System Ax = 0 for Any M * N Matrix a in R (Find Null Space Basis for A)
Use 'J' to Select the Join Column of 'X' and All Its Non-Join Columns
Shiny Leaflet Easyprint Plugin
Function for Polynomials of Arbitrary Order (Symbolic Method Preferred)
R Bnlearn Eval Inside Function
"Dims [Product Xx] Do Not Match the Length of Object [Xx]" Error in Using R Function 'Outer'
Ggplot2 Force Y-Axis to Start at Origin and Float Y-Axis Upper Limit
Get Rows of Unique Values by Group
R: Read in Random Rows from File Using Fread or Equivalent
R Programming: Read.Csv() Skips Lines Unexpectedly
Do I Need to Reshape This Wide Data to Effectively Use Ggplot2
R: Fast (Conditional) Subsetting Where Feasible
How to Set Ggplot X-Label Equal to Variable Name During Lapply
R Geom_Tile Ggplot2 What Kind of Stat Is Applied
Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble