Merging on Multiple Columns R

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
 but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

merging on multiple columns R

Try this:

library(dplyr)

full_join(data1, data2, by = c("id", "spp"))

Output:

  spp trait.1 id trait.2
1   A     1.1  1      NA
2   B     1.1  2       2
3   C     1.1  3      NA
4   C      NA  9       2
5   D      NA  7       2

Alternatively, also merge would work:

merge(data1, data2, by = c("id", "spp"), all = TRUE)

Merging two dataframes by multiple columns without losing data

merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)

 total <- merge(df1,df2,by=c("id","year"), all=TRUE)

How can I combine several columns into one variable, tacking each onto the end of the other and grouping by values in an ID variable?

Try to set the inputs of the function pivot_longer()correctly as cols and values_to. cols=... defines the columns which you are taking the values from. values_to = ... defines the new name of the column where you are writing the values you took from 'cols'. Actually I think you were doing good, just pivot_longer returns always the names of the columns which values you are taking from, unless you try other trickier things.

library(tidyverse)

df = data.frame(
  a = c("string1","string2"),
  b= c("string11","string12"),
  c = c("string21", "string22"),
  ID = c("1111","2222")
)

df %>% 
  pivot_longer(cols = names(df)[1:3],
                    values_to = "newvar") %>% 
  select(newvar, ID)

Output:

# A tibble: 6 x 2
  newvar   ID   
  <chr>    <chr>
1 string1  1111 
2 string11 1111 
3 string21 1111 
4 string2  2222 
5 string12 2222 
6 string22 2222

merge dataframes based on multiple columns and thresholds

I first turned the city names into character vectors, since (if I understood correctly) you want to include city names that are contained within df2.

df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)

Then merge them by country:

df = merge(df1, df2, by = ("ctry"))

> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue

The library stringr will allow you to see if city.x is within city.y here (see last column):

library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col city_keep
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige      TRUE
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange      TRUE
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green      TRUE
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow      TRUE
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red      TRUE
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue      TRUE
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red      TRUE
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple     FALSE
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black      TRUE
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue      TRUE

Then you can get the difference in days between dates:

df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)

and the difference in numbers:

df$numDiff<-abs(df$number - df$other_number)

Here was what the resulting dataframe looks like:

> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col city_keep dayDiff numDiff
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige      TRUE      29       1
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange      TRUE       1       1
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green      TRUE       0       3
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow      TRUE       0       3
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red      TRUE       0       5
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue      TRUE       2     710
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red      TRUE       1       5
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple     FALSE       0       1
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black      TRUE     212       0
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue      TRUE       0    3070

But we want to drop things where city.x was not found within city.y, where the day difference is greater than 5 or the number difference is greater than 3:

df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]

> df
     ctry     date.x     city.x number      col     date.y     city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29 Copenhagen           61    orange      TRUE       1       1
3  France 1999-06-12      Paris     20   banana 1999-06-12 East-Paris           17     green      TRUE       0       3
4 Germany 2003-08-29     Berlin     10    apple 2003-08-29     Berlin           13    yellow      TRUE       0       3

What is left are the three rows that you had above (which contained dots in column 1).

Now we can drop the three columns we created, and the date and city from df2:

> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
     ctry     date.x     city.x number      col other_number other_col
2 Denmark 1999-06-30 Copenhagen     60 cucumber           61    orange
3  France 1999-06-12      Paris     20   banana           17     green
4 Germany 2003-08-29     Berlin     10    apple           13    yellow

How can I combine multiple columns into one in an R dataset?

A solution using tidyverse. dat4 is the final output.

library(tidyverse)

dat2 <- dat %>%
  mutate(ID = 1:n())

dat3 <- dat2 %>%
  pivot_longer(a:f, names_to = "value", values_to = "number") %>%
  filter(number == 1) %>%
  select(-number)

dat4 <- dat2 %>%
  left_join(dat3) %>%
  select(-ID, -c(a:f)) %>%
  replace_na(list(value = "none"))

dat4
#   age gender  race insured value
# 1  13 Female white       0  none
# 2  12 Female white       1  none
# 3  19   Male other       0     f
# 4  19 Female white       0     b
# 5  13 Female white       0     a
# 6  13 Female white       0     b
# 7  13 Female white       0     f

DATA

dat <- read.table(text = "      age gender a     b     c     d     e     f     race  insured 
 1     13 Female 0     0     0     0     0     0     white 0      
 2     12 Female 0     0     0     0     0     0     white 1      
 3     19 Male   0     0     0     0     0     1     other 0      
 4     19 Female 0     1     0     0     0     0     white 0      
 5     13 Female 1     1     0     0     0     1     white 0",
                  header = TRUE)

Merging two dataframes by two columns resulting in blank df

You should first tell R that both liver_date columns are dates. The function as.Date let you do that.

So let`s say we got df1 and df2

date1<-(c("2007-08-01", "2004-10-05", "2014-03-09"))#Year - Month - Day
date2<-(c("8/1/07", "10/5/04", "3/9/14"))#Month/Day/Year 
x<-(c(1:3))
z<-c(11:13)
w<-c(11:13)

df1<-data.frame(date1, x, z)
str(df1$data1)
  
df1

> df1
       date1 x  z
1 2007-08-01 1 11
2 2004-10-05 2 12
3 2014-03-09 3 13

df2<-data.frame(date2, x, w)
str(df2$date2)

df2 
> df2
    date2 x  w
1  8/1/07 1 11
2 10/5/04 2 12
3  3/9/14 3 13

With as.Date you tell the format of the column in which the dates are, for df1 is a Y-M-D

df1$date1<-as.Date.character(df1$date1,format="%Y-%m-%d")
str(df1$date1)

And for df2 is m/d/y

df2$date1<-as.Date.character(df2$date2,format="%m/%d/%y")
str(df2$date1)

We recode the df2$date2 in df2$date1 to match the name of the column, this will be needed by the merge function later, in your case you could recode in the same column because they have the same name:

df3<-merge(df1,df2, by =c("date1", "x" ) )
df3

>df3
       date1 x  z   date2  w
1 2004-10-05 2 12 10/5/04 12
2 2007-08-01 1 11  8/1/07 11
3 2014-03-09 3 13  3/9/14 13

As you can see, z and w match perfectly, so we know we did it right.

In your data:

df1 = qtpo_liver_dates

df2 = labs_v500

date1, date2 = liver_date

x = patient_num

z = Some column in qtpo_liver_dates

w = Some column in labs_v500