Merge rows in a dataframe where the rows are disjoint and contain NAs
You can use aggregate
. Assuming that you want to merge rows with identical values in column name
:
aggregate(x=DF[c("v1","v2","v3","v4")], by=list(name=DF$name), min, na.rm = TRUE)
name v1 v2 v3 v4
1 Yemen 4 2 3 5
This is like the SQL SELECT name, min(v1) GROUP BY name
. The min
function is arbitrary, you could also use max
or mean
, all of them return the non-NA value from an NA and a non-NA value if na.rm = TRUE
.
(An SQL-like coalesce()
function would sound better if existed in R.)
However, you should check first if all non-NA values for a given name
is identical. For example, run the aggregate
both with min
and max
and compare, or run it with range
.
Finally, if you have many more variables than just v1-4, you could use DF[,!(names(DF) %in% c("code","name"))]
to define the columns.
combine rows in data frame containing NA to make complete row
I haven't figured out how to put the coalesce_by_column
function inside the dplyr
pipeline, but this works:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
## A B C D E
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 2 5
## 2 2 4 5 3 4
Edit: include @Jon Harmon's solution for more than 2 members of a group
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
return(dplyr::coalesce(!!! as.list(df)))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
#> # A tibble: 2 x 5
#> A B C D E
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 2 5
#> 2 2 4 5 3 4
r - merge rows in group while replacing NAs
No need to delete the question, it may be helpful to some users. This summarises each group to the first non NA occurrence for each column.
library(dplyr)
df_start <- data.frame(
id = c("as", "as", "as", "as", "as", "bs", "bs", "bs", "bs", "bs"),
b = c(NA, NA, NA, NA, "A", NA, NA, 6, NA, NA),
c = c(2, NA, NA, NA, NA, 7, NA, NA, NA, NA),
d = c(NA, 4, NA, NA, NA, NA, 8, NA, NA, NA),
e = c(NA, NA, NA, 3, NA, NA, NA, NA, "B", NA),
f = c(NA, NA, 5, NA, NA, NA, NA, NA, NA, 10))
df_start %>%
group_by(id) %>%
summarise_all(list(~first(na.omit(.))))
Output:
# A tibble: 2 x 6
id b c d e f
<fct> <fct> <dbl> <dbl> <fct> <dbl>
1 as A 2. 4. 3 5.
2 bs 6 7. 8. B 10.
You will, of course, get some data lost if there is more than one occurrence of a value with each group for each column.
Merge two rows in data.frame
An idea via dplyr
,
library(dplyr)
df %>%
group_by(Date, Origin) %>%
summarise_all(funs(trimws(paste(., collapse = ''))))
A tibble: 4 x 5
Groups: Date [?]
Date Origin Checkin Checkout Destination
<chr> <chr> <chr> <chr> <chr>
1 03-07-17 A 08:00 09:00 B
2 03-07-17 B 17:00 18:00 A
3 04-07-17 A 08:00 09:00 B
4 04-07-17 B 17:00 18:00 A
DATA
dput(df)
structure(list(Date = c(" 03-07-17 ", " 03-07-17 ", " 03-07-17 ",
" 03-07-17 ", " 04-07-17 ", " 04-07-17 ", " 04-07-17 ", " 04-07-17 "
), Checkin = c(" 08:00 ", " ", " 17:00 ", " ",
" 08:00 ", " ", " 17:00 ", " "), Origin = c(" A ",
" A ", " B ", " B ", " A ", " A ", " B ",
" B "), Checkout = c(" ", " 09:00 ", " ",
" 18:00 ", " ", " 09:00 ", " ", " 18:00 "
), Destination = c(" ", " B ", " ",
" A ", " ", " B ", " ",
" A ")), .Names = c("Date", "Checkin", "Origin", "Checkout",
"Destination"), row.names = c(NA, -8L), class = "data.frame")
Combine rows by group with differing NAs in each row
Is this what you want ? zoo
+dplyr
also check the link here
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))%>%filter(row_number()==n())
# A tibble: 1 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n 2 2
EDIT1
without the filter , will give back whole dataframe.
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))
# A tibble: 2 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n NA 2
2 1 0 n 2 2
filter
here, just slice the last one, na.locf
will carry on the previous not NA
value, which mean the last row in your group is what you want.
Also base on @ thelatemail recommended. you can do the following , give back the same answer.
df %>% group_by(groupid) %>% summarise_all(funs(.[!is.na(.)][1]))
EDIT2
Assuming you have conflict and you want to show them all.
df <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 1 NA 2 2",
header=TRUE,stringsAsFactors=FALSE)
df
groupid col1 col2 col3 col4
1 1 0 n NA 2
2 1 1(#)<NA> 2 2(#)
df %>%
group_by(groupid) %>%
summarise_all(funs(toString(unique(na.omit(.)))))#unique for duplicated like col4
groupid col1 col2 col3 col4
<int> <chr> <chr> <chr> <chr>
1 1 0, 1 n 2 2
Merge rows in one data.frame
We could use data.table
. We convert the 'data.frame' to 'data.table' (setDT(data)
), grouped by 'name', we unlist
the columns specified in the .SDcols
, and paste
it together.
library(data.table)
setDT(data)[, unlist(.SD), name, .SDcols=v1:v4][V1!='', paste(V1, collapse=', '), name]
As the expected output is not showed, it could be also
setDT(data)[, lapply(.SD, function(x) paste(x[x!=''], collapse='')) , name, .SDcols= v1:v4]
Update
Based on the expected output, we convert the 'factor' columns ('v1:v4') to 'character' class, then use the formula method of aggregate
and paste
the columns grouped by 'name'.
data[3:6] <- lapply(data[3:6], as.character)
aggregate(.~name, data[-1], FUN=function(x) paste(x[x!=''], collapse=', '))
Collapsing rows where some are all NA, others are disjoint with some NAs
Try
library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))
Edit: To account for the case in which one column has all NAs
for a certain ID
, we need sum_NA()
function which returns NA
if all are NAs
txt <- "ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)
# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40
sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}
DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))
DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
R- combine rows of a data frame to be unique by 3 columns
After I made sure all columns classes are numeric (not factors) by defining the classes of columns while reading the data in, this worked for me:
CompleteCoxObs<-aggregate(x=CompleteCoxObs[c("stop","Value_EVS current weight kg CAL","Value_EVS hr heart rate NU EE0A","Value_EVS temp celsius CAL 113C")], by=list(VisitIDCode=CompleteCoxObs$VisitIDCode,start=CompleteCoxObs$start), max, na.rm = FALSE);
How to merge 2 columns within the same dataframe in R
You can use coalesce
library(dplyr)
df %>%
mutate(Var1.2 = coalesce(Var1, Var2))
#> Year Var1 Var2 Var1.2
#> 1 2014 123 123 123
#> 2 2014 NA 155 155
#> 3 2015 541 NA 541
#> 4 2015 432 432 432
#> 5 2016 NA 124 124
Created on 2019-04-11 by the reprex package (v0.2.1.9000)
Related Topics
Fill Missing Combinations in a Dataframe
Creating a Local R Package Repository
Add Line Break to Axis Labels and Ticks in Ggplot
Way to Securely Give a Password to R Application from the Terminal
Fill Region Between Two Loess-Smoothed Lines in R with Ggplot
Use Grepl to Search Either of Multiple Substrings in a Text
Read.CSV Doesn't Seem to Detect Factors in R 4.0.0
Merge Many Data Frames from CSV Files, When Id Column Is Implied
Switch Displayed Traces via Plotly Dropdown Menu
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
Showing String in Formula and Not as Variable in Lm Fit
Rmarkdown: How to Change the Font Color
R Strsplit with Multiple Unordered Split Arguments
How to Speed Up Subset by Groups
Differencebetween [ ] and [[ ]] in R
How to Generate Distributions Given, Mean, Sd, Skew and Kurtosis in R