Merge/Combine Columns with Same Name But Incomplete Data

merge/combine columns with same name but incomplete data

Here's an approach that involves melting your data, merging the molten data, and using dcast to get it back to a wide form. I've added comments to help understand what is going on.

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),
## melt the second data.frame
melt(as.data.table(df2), id.vars = "ID"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
is.na(value.x), value.y, value.x)],
## This is your reshaping formula
ID ~ variable, value.var = "value")
# ID hello world football baseball hockey soccer
# 1: 1 2 3 43 6 7 4
# 2: 2 5 1 24 32 2 5
# 3: 3 10 8 2 23 8 23
# 4: 4 4 17 5 15 5 12
# 5: 5 9 7 12 23 3 43

R - merge/combine columns with same name but some data values equal zero

There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.

df1 <- read.table(text = 
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')

df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')

# Check what variables are common
common <- intersect(names(df1),names(df2))

# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))

# Merge dataframes
df <- merge(df1, df2,by = "ID")

# Sum and clean common variables left in merged dataframe
library(dplyr)

for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}

df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1

R: merging columns and the values if they have the same column name

Solution 1

Using split(), lapply(), rowSums(), and do.call()/cbind():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 2

Replacing the rowSums() call with Reduce()/`+`():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 3

Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:

do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Benchmarking

library(microbenchmark);

bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));

df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742 641.052 100
## bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618 100
## bgoldst3(df) 82.110 92.5875 100.9138 97.2915 107.128 170.207 100
## sotos(df) 200.997 211.9030 226.7977 223.6630 235.210 328.010 100

set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929 100
## bgoldst2(df) 4.534402 4.680446 6.161428 6.097900 6.425697 44.83254 100
## bgoldst3(df) 3.430203 3.555505 5.355128 4.919931 5.219930 41.79279 100
## sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525 100

Joining two incomplete data.tables with the same column names

You can group by ID and get the unique values after omitting NAs, i.e.

library(data.table)

merge(dt1, dt2, all = TRUE)[,
lapply(.SD, function(i)na.omit(unique(i))),
by = id][]

# id v1 v2
#1: 1 w a
#2: 2 x b
#3: 3 y c
#4: 4 z <NA>

Combine two columns with same name pandas

You could do:

df.T.reset_index().groupby('index').agg(','.join).T

index city country house_number ... road state unit
0 greensboro,7611 us 3200 ... northline ave nc ste

How do I merge data sets with some of the same columns without matching the elements but rather adding them to the vector?

The bind_rows() function from the dplyr library is what you need! To 'merge' three datasets into one, while respecting column names, use the command like this:

library(dplyr)
dfAll<-bind_rows(dfA, dfB, dfC)

Edit: Update, directly call all three datasets. Removed intermediate step as first posted.



Related Topics



Leave a reply



Submit