How to Collapse Many Records into One While Removing Na Values

How to collapse many records into one while removing NA values

Here's an option with dplyr:

library(dplyr)

df %>%
group_by(name) %>%
summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))

#Source: local data frame [3 x 3]
#
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Joe 456 North Ave Pirates
#3 Rob 234 Broad St Mets

And with data.table:

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates

Or

setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]

Edit:

You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.

Consider this modified sample data (look at last row):

name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)

df <- data.frame(name = name,
address = address,
favteam = favteam)

df
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Rob 234 Broad St Mets
#3 Joe <NA> Pirates
#4 Joe 456 North Ave <NA>
#5 Joe 123 Boulevard <NA>

Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:

setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates
#4: Joe 123 Boulevard Pirates

Collapsing rows where some are all NA, others are disjoint with some NAs

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
1 NA NA NA NA
1 5 10 NA NA
1 NA NA 15 20
2 NA NA NA NA
2 NA 30 NA NA
2 NA NA 35 40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 0 30 35 40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
group_by(ID) %>%
summarise_all(funs(sum_NA))

DF %>%
group_by(ID) %>%
summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 5 10 15 20
2 2 NA 30 35 40

Collapse rows across group and remove duplicates and NAs

We can use group_by with summarise(across(everything(), ...)) to apply a function to every column. That function in our case is written as a formula (the ~-notation), in which the column is called .x.

As you suggested, we can paste (with collapse = ", ") the rows together. I remove the NA values with .x[!is.na(.x)].

df_in %>% 
group_by(group, subgroup) %>%
summarise(across(everything(), ~ paste(unique(.x[!is.na(.x)]), collapse = ", "))) %>%
ungroup()

The only difference with your expected output is that the shape column is now an empty string instead of an NA value:

# A tibble: 1 x 6
group subgroup color shape emotion shade
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 a red "" happy, sad striped

That can be fixed by for example creating a function that replaces the zero-length list with NA before pasting.

paste_rows <- function(x) {
unique_x <- unique(x[!is.na(x)])
if (length(unique_x) == 0) {
unique_x <- NA
}

paste(unique_x, collapse = ", ")
}

df_in %>%
group_by(group, subgroup) %>%
summarise(across(everything(), paste_rows)) %>%
ungroup()

Pandas: Collapse many rows into a single row by removing NaN's in a multiindex dataframe

If there is always one non missing value per groups use GroupBy.first for return first non NaN value per first level of MultiIndex:

df = df.groupby(level=0).first()
print (df)
D
abc
G2 G3 G4 G1 G5
x 100.0 200.0 300.0 NaN NaN
y NaN NaN NaN 200.0 100.0

If there is multiple non missing values only first is returned and of all missing values is returned one row:

print (df)
D
abc
G2 G3 G4 G1 G5
x 1 100.0 NaN NaN NaN NaN
2 8.0 200.0 NaN NaN NaN <- multiple values
3 NaN NaN 300.0 NaN NaN
y 4 NaN NaN NaN NaN NaN <- all missing values
5 NaN NaN NaN NaN NaN <- all missing values

df = df.groupby(level=0).first()
print (df)
D
abc
G2 G3 G4 G1 G5
x 100.0 200.0 300.0 NaN NaN
y NaN NaN NaN NaN NaN

EDIT:

If no MultiIndex then need different solution:

df = df.pivot(index=None, columns=['A', 'B', 'C'])

#no MultiIndex
print (df.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64')



if df.index.nlevels == 1:

df1 = df.apply(lambda x: pd.Series(x.dropna().to_numpy())).iloc[[0]]
print (df1)
D
A abc
B ab bc cd de ef
C G1 G1 G2 G3 G2
0 1.0 2.0 3.0 4.0 5.0

else:
df1 = df.groupby(level=0).first()
print (df1)

R collapse rows by group with non-missing values when values are character

You can do:

df %>%
group_by(store) %>%
summarise_all(~ .[nchar(.) > 1])

store item1 item2
<chr> <chr> <chr>
1 A apple pear
2 B milk bread

How do I collapse rows to fill NAs in groups with uneven number of rows per column?

We can do a group by row_number

library(dplyr)
library(tidyr)
test %>%
group_by(year) %>%
mutate(rn = row_number()) %>%
ungroup %>%
spread(year, name) %>%
select(-rn)
# A tibble: 5 x 6
# group `1988` `1997` `2000` `2001` `2002`
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 A Steve <NA> <NA> <NA> <NA>
#2 B <NA> <NA> <NA> Mike Jaimie
#3 B <NA> <NA> <NA> Paul <NA>
#4 C <NA> John <NA> <NA> <NA>
#5 D <NA> <NA> Marco <NA> <NA>

In the newer version of tidyr, it is better to use pivot_wider

test %>%
group_by(year) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = year, values_from = name) %>%
select(-rn)

Collapse rows in R

We get the distinct rows to generate the first expected

library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2

The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns

df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')

-output

# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2


Related Topics



Leave a reply



Submit