Merge/Combine Columns with Same Name But Incomplete Data

merge/combine columns with same name but incomplete data

Here's an approach that involves melting your data, merging the molten data, and using dcast to get it back to a wide form. I've added comments to help understand what is going on.

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
  merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), 
    ## melt the second data.frame
    melt(as.data.table(df2), id.vars = "ID"), 
    ## you'll have 2 value columns...
    all = TRUE)[, value := ifelse(
      ## ... combine them into 1 with ifelse
      is.na(value.x), value.y, value.x)], 
  ## This is your reshaping formula
  ID ~ variable, value.var = "value")
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

R - merge/combine columns with same name but some data values equal zero

There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.

df1 <- read.table(text = 
'     cough nasal sputum yellow intermitt
1      1     0      0      0         0
2      1     0      0      0         0
3      0     0      0      0         0
4      0     0      0      0         0
5      0     0      0      0         0
6      1     0      0      0         0
7      0     0      0      0         0
8      0     0      0      0         0
9      0     0      0      0         0
10     0     0      0      0         0')

df2 <- read.table(text = 
'   Data_number cough coughing_up_blood dehydration dental_abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0')

# Check what variables are common
common <- intersect(names(df1),names(df2))

# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))

# Merge dataframes
df <- merge(df1, df2,by = "ID")

# Sum and clean common variables left in merged dataframe
library(dplyr)

for (variable in common){
  # Create a summed variable
  df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
  # Delete columns with .x and .y suffixes
  df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}

df
   ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1   1     0      0      0         0           1                 0           0              0     1
2   2     0      0      0         0           3                 0           0              0     2
3   3     0      0      0         0           6                 0           0              0     0
4   4     0      0      0         0           8                 0           0              0     0
5   5     0      0      0         0           9                 0           0              0     0
6   6     0      0      0         0          11                 0           0              0     2
7   7     0      0      0         0          12                 0           0              0     0
8   8     0      0      0         0          13                 0           0              0     0
9   9     0      0      0         0          15                 0           0              0     0
10 10     0      0      0         0          16                 0           0              0     1

R: merging columns and the values if they have the same column name

Solution 1

Using split(), lapply(), rowSums(), and do.call()/cbind():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 2

Replacing the rowSums() call with Reduce()/`+`():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 3

Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:

do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Benchmarking

library(microbenchmark);

bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));

df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
##          expr     min       lq     mean   median      uq      max neval
##  bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742  641.052   100
##  bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618   100
##  bgoldst3(df)  82.110  92.5875 100.9138  97.2915 107.128  170.207   100
##     sotos(df) 200.997 211.9030 226.7977 223.6630 235.210  328.010   100

set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
##          expr       min        lq      mean    median        uq      max neval
##  bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929   100
##  bgoldst2(df)  4.534402  4.680446  6.161428  6.097900  6.425697 44.83254   100
##  bgoldst3(df)  3.430203  3.555505  5.355128  4.919931  5.219930 41.79279   100
##     sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525   100

Joining two incomplete data.tables with the same column names

You can group by ID and get the unique values after omitting NAs, i.e.

library(data.table)

merge(dt1, dt2, all = TRUE)[, 
        lapply(.SD, function(i)na.omit(unique(i))), 
                            by = id][]

#   id v1   v2
#1:  1  w    a
#2:  2  x    b
#3:  3  y    c
#4:  4  z <NA>

Combine two columns with same name pandas

You could do:

df.T.reset_index().groupby('index').agg(','.join).T

index             city country house_number  ...           road state     unit
0      greensboro,7611      us         3200  ...  northline ave    nc  ste

How do I merge data sets with some of the same columns without matching the elements but rather adding them to the vector?

The bind_rows() function from the dplyr library is what you need! To 'merge' three datasets into one, while respecting column names, use the command like this:

library(dplyr)
dfAll<-bind_rows(dfA, dfB, dfC)

Edit: Update, directly call all three datasets. Removed intermediate step as first posted.

Merge/Combine Columns with Same Name But Incomplete Data