How to Collapse Many Records into One While Removing Na Values

How to collapse many records into one while removing NA values

Here's an option with dplyr:

library(dplyr)

df %>%
  group_by(name) %>%
  summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))

#Source: local data frame [3 x 3]
#
#  name       address favteam
#1 Bill   123 Main St Dodgers
#2  Joe 456 North Ave Pirates
#3  Rob  234 Broad St    Mets

And with data.table:

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
#   name       address favteam
#1: Bill   123 Main St Dodgers
#2:  Rob  234 Broad St    Mets
#3:  Joe 456 North Ave Pirates

setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]

Edit:

You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.

Consider this modified sample data (look at last row):

name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)

df <- data.frame(name = name, 
                 address = address,
                 favteam = favteam)

df
#  name       address favteam
#1 Bill   123 Main St Dodgers
#2  Rob  234 Broad St    Mets
#3  Joe          <NA> Pirates
#4  Joe 456 North Ave    <NA>
#5  Joe 123 Boulevard    <NA>

Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:

setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
#   name       address favteam
#1: Bill   123 Main St Dodgers
#2:  Rob  234 Broad St    Mets
#3:  Joe 456 North Ave Pirates
#4:  Joe 123 Boulevard Pirates

Collapsing rows where some are all NA, others are disjoint with some NAs

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Collapse rows across group and remove duplicates and NAs

We can use group_by with summarise(across(everything(), ...)) to apply a function to every column. That function in our case is written as a formula (the ~-notation), in which the column is called .x.

As you suggested, we can paste (with collapse = ", ") the rows together. I remove the NA values with .x[!is.na(.x)].

df_in %>% 
  group_by(group, subgroup) %>% 
  summarise(across(everything(), ~ paste(unique(.x[!is.na(.x)]), collapse = ", "))) %>% 
  ungroup()

The only difference with your expected output is that the shape column is now an empty string instead of an NA value:

# A tibble: 1 x 6
  group subgroup color shape emotion    shade  
  <dbl> <chr>    <chr> <chr> <chr>      <chr>  
1     1 a        red   ""    happy, sad striped

That can be fixed by for example creating a function that replaces the zero-length list with NA before pasting.

paste_rows <- function(x) {
  unique_x <- unique(x[!is.na(x)])
  if (length(unique_x) == 0) {
    unique_x <- NA
  }
  
  paste(unique_x, collapse = ", ")
}

df_in %>% 
  group_by(group, subgroup) %>% 
  summarise(across(everything(), paste_rows)) %>% 
  ungroup()

Pandas: Collapse many rows into a single row by removing NaN's in a multiindex dataframe

If there is always one non missing value per groups use GroupBy.first for return first non NaN value per first level of MultiIndex:

df = df.groupby(level=0).first()
print (df)
       D                            
     abc                            
      G2     G3     G4     G1     G5
x  100.0  200.0  300.0    NaN    NaN
y    NaN    NaN    NaN  200.0  100.0

If there is multiple non missing values only first is returned and of all missing values is returned one row:

print (df)
         D                      
       abc                      
        G2     G3     G4  G1  G5
x 1  100.0    NaN    NaN NaN NaN
  2    8.0  200.0    NaN NaN NaN <- multiple values
  3    NaN    NaN  300.0 NaN NaN
y 4    NaN    NaN    NaN NaN NaN  <- all missing values
  5    NaN    NaN    NaN NaN NaN  <- all missing values

df = df.groupby(level=0).first()
print (df)
       D                      
     abc                      
      G2     G3     G4  G1  G5
x  100.0  200.0  300.0 NaN NaN
y    NaN    NaN    NaN NaN NaN

EDIT:

If no MultiIndex then need different solution:

df = df.pivot(index=None, columns=['A', 'B', 'C'])

#no MultiIndex
print (df.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64')



if df.index.nlevels == 1:

    df1 = df.apply(lambda x: pd.Series(x.dropna().to_numpy())).iloc[[0]]
    print (df1)
             D                    
    A  abc                    
    B   ab   bc   cd   de   ef
    C   G1   G1   G2   G3   G2
    0  1.0  2.0  3.0  4.0  5.0

else:
    df1 = df.groupby(level=0).first()
    print (df1)

R collapse rows by group with non-missing values when values are character

You can do:

df %>%
 group_by(store) %>%
 summarise_all(~ .[nchar(.) > 1])

  store item1 item2
  <chr> <chr> <chr>
1 A     apple pear 
2 B     milk  bread

How do I collapse rows to fill NAs in groups with uneven number of rows per column?

We can do a group by row_number

library(dplyr)
library(tidyr)
test %>%
   group_by(year) %>% 
   mutate(rn = row_number()) %>%
   ungroup %>%
   spread(year, name) %>%
   select(-rn)
# A tibble: 5 x 6
#  group `1988` `1997` `2000` `2001` `2002`
#  <chr> <chr>  <chr>  <chr>  <chr>  <chr> 
#1 A     Steve  <NA>   <NA>   <NA>   <NA>  
#2 B     <NA>   <NA>   <NA>   Mike   Jaimie
#3 B     <NA>   <NA>   <NA>   Paul   <NA>  
#4 C     <NA>   John   <NA>   <NA>   <NA>  
#5 D     <NA>   <NA>   Marco  <NA>   <NA>

In the newer version of tidyr, it is better to use pivot_wider

test %>%
       group_by(year) %>% 
       mutate(rn = row_number()) %>%
       ungroup %>%
       pivot_wider(names_from = year, values_from = name) %>%
       select(-rn)

Collapse rows in R

We get the distinct rows to generate the first expected

library(dplyr)
df %>%
  distinct
  id1  id2 id3 n1 n2 n3 n4
1   a <NA>   a  2  2  0  0
2   b    a   a  2  1  1  1
3   c <NA>   e  3  1  3  2

The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns

df %>%
    distinct %>%
    group_by(id1 = coalesce(id2, id1), id3) %>% 
    summarise(across(where(is.numeric), sum), .groups = 'drop')

-output

# A tibble: 2 × 6
  id1   id3      n1    n2    n3    n4
  <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a     a         4     3     1     1
2 c     e         3     1     3     2

How to Collapse Many Records into One While Removing Na Values