Collapsing rows where some are all NA, others are disjoint with some NAs
Try
library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))
Edit: To account for the case in which one column has all NAs
for a certain ID
, we need sum_NA()
function which returns NA
if all are NAs
txt <- "ID Col1 Col2 Col3 Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)
# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40
sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}
DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))
DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40
Create all possible combinations of non-NA values for each group ID
Grouped by 'ID', fill
other columns, ungroup
to remove the group attribute and keep the distinct
rows
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
ungroup %>%
distinct(.keep_all = TRUE)
Or may also be
DF %>%
group_by(ID) %>%
mutate(across(everything(), ~ replace(., is.na(.),
rep(.[!is.na(.)], length.out = sum(is.na(.))))))
Or based on the comments
DF %>%
group_by(ID) %>%
mutate(across(where(~ any(is.na(.))), ~ {
i1 <- is.na(.)
ind <- which(i1)
i2 <- !i1
if(i1[1] == 1) rep(.[i2], each = n()/sum(i2)) else
rep(.[i2], length.out = n())
})) %>%
ungroup %>%
distinct(.keep_all = TRUE)
-output
# A tibble: 6 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 6 10 15 20
2 1 5 10 15 20
3 2 17 25 21 34
4 2 13 25 21 34
5 2 17 25 35 40
6 2 13 25 35 40
Collapse Elements in R with NA
If we don't mind losing the order, then maybe try this:
apply(df1, 2, sort, na.last = TRUE)
To keep the order:
sapply(1:ncol(df1),
function(i){
c(
df1[, i][!is.na(df1[, i])],
df1[, i][ is.na(df1[, i])]
)
})
Merge rows in a dataframe where the rows are disjoint and contain NAs
You can use aggregate
. Assuming that you want to merge rows with identical values in column name
:
aggregate(x=DF[c("v1","v2","v3","v4")], by=list(name=DF$name), min, na.rm = TRUE)
name v1 v2 v3 v4
1 Yemen 4 2 3 5
This is like the SQL SELECT name, min(v1) GROUP BY name
. The min
function is arbitrary, you could also use max
or mean
, all of them return the non-NA value from an NA and a non-NA value if na.rm = TRUE
.
(An SQL-like coalesce()
function would sound better if existed in R.)
However, you should check first if all non-NA values for a given name
is identical. For example, run the aggregate
both with min
and max
and compare, or run it with range
.
Finally, if you have many more variables than just v1-4, you could use DF[,!(names(DF) %in% c("code","name"))]
to define the columns.
r - merge rows in group while replacing NAs
No need to delete the question, it may be helpful to some users. This summarises each group to the first non NA occurrence for each column.
library(dplyr)
df_start <- data.frame(
id = c("as", "as", "as", "as", "as", "bs", "bs", "bs", "bs", "bs"),
b = c(NA, NA, NA, NA, "A", NA, NA, 6, NA, NA),
c = c(2, NA, NA, NA, NA, 7, NA, NA, NA, NA),
d = c(NA, 4, NA, NA, NA, NA, 8, NA, NA, NA),
e = c(NA, NA, NA, 3, NA, NA, NA, NA, "B", NA),
f = c(NA, NA, 5, NA, NA, NA, NA, NA, NA, 10))
df_start %>%
group_by(id) %>%
summarise_all(list(~first(na.omit(.))))
Output:
# A tibble: 2 x 6
id b c d e f
<fct> <fct> <dbl> <dbl> <fct> <dbl>
1 as A 2. 4. 3 5.
2 bs 6 7. 8. B 10.
You will, of course, get some data lost if there is more than one occurrence of a value with each group for each column.
Combine rows by group with differing NAs in each row
Is this what you want ? zoo
+dplyr
also check the link here
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))%>%filter(row_number()==n())
# A tibble: 1 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n 2 2
EDIT1
without the filter , will give back whole dataframe.
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))
# A tibble: 2 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n NA 2
2 1 0 n 2 2
filter
here, just slice the last one, na.locf
will carry on the previous not NA
value, which mean the last row in your group is what you want.
Also base on @ thelatemail recommended. you can do the following , give back the same answer.
df %>% group_by(groupid) %>% summarise_all(funs(.[!is.na(.)][1]))
EDIT2
Assuming you have conflict and you want to show them all.
df <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 1 NA 2 2",
header=TRUE,stringsAsFactors=FALSE)
df
groupid col1 col2 col3 col4
1 1 0 n NA 2
2 1 1(#)<NA> 2 2(#)
df %>%
group_by(groupid) %>%
summarise_all(funs(toString(unique(na.omit(.)))))#unique for duplicated like col4
groupid col1 col2 col3 col4
<int> <chr> <chr> <chr> <chr>
1 1 0, 1 n 2 2
combine rows in data frame containing NA to make complete row
I haven't figured out how to put the coalesce_by_column
function inside the dplyr
pipeline, but this works:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
## A B C D E
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 2 5
## 2 2 4 5 3 4
Edit: include @Jon Harmon's solution for more than 2 members of a group
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
return(dplyr::coalesce(!!! as.list(df)))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
#> # A tibble: 2 x 5
#> A B C D E
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 2 5
#> 2 2 4 5 3 4
How to collapse many records into one while removing NA values
Here's an option with dplyr:
library(dplyr)
df %>%
group_by(name) %>%
summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))
#Source: local data frame [3 x 3]
#
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Joe 456 North Ave Pirates
#3 Rob 234 Broad St Mets
And with data.table:
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates
Or
setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]
Edit:
You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.
Consider this modified sample data (look at last row):
name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)
df <- data.frame(name = name,
address = address,
favteam = favteam)
df
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Rob 234 Broad St Mets
#3 Joe <NA> Pirates
#4 Joe 456 North Ave <NA>
#5 Joe 123 Boulevard <NA>
Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:
setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates
#4: Joe 123 Boulevard Pirates
A better way to collapse rows with numerical value and NA
The issue is when you only have NA
s ("no non-missing arguments"). Here are workarounds using dplyr
and data.table
:
abc %>%
group_by(ID) %>%
summarize_all(~ if (length(na.omit(.))) max(., na.rm = TRUE) else NA_real_ ) %>%
ungroup()
setDT(abc)
abc[,
lapply(.SD, function(.) if (length(na.omit(.))) max(., na.rm = TRUE) else NA_real_),
by = ID]
Related Topics
Increase Distance Between Text and Title on the Y-Axis
Creating a Comma Separated Vector
Can Dplyr Summarise Over Several Variables Without Listing Each One
Understanding the Order() Function
Read Multiple CSV Files into Separate Data Frames
Method to Extract Stat_Smooth Line Fit
How to Change the Y-Axis Figures into Percentages in a Barplot
Replace All Particular Values in a Data Frame
How to Add Code Folding to Output Chunks in Rmarkdown HTML Documents
Expand Rows by Date Range Using Start and End Date
How to Extract a Single Column from a Data.Frame as a Data.Frame
Rotating X Axis Labels in R For Barplot
Formatting Dates on X Axis in Ggplot2
Figure Position in Markdown When Converting to Pdf With Knitr and Pandoc
Better Explanation of When to Use Imports/Depends