Collapsing Data Frame by Selecting One Row Per Group

Collapsing data frame by selecting one row per group

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

Collapse specific rows/cases of dataframe

We could create a grouping based on the presence of those ID using %in%

library(dplyr)
library(stringr)
df1 %>%
group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L,
TRUE ~ row_number()), Species) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)

-output

# A tibble: 4 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width ID
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 setosa 5 3.25 2.8 0.4 ID_1+ID_2
2 setosa 4.7 3.2 1.3 0.2 ID_3
3 setosa 4.6 3.1 1.5 0.2 ID_4
4 setosa 5 3.6 1.4 0.2 ID_5

if there is only a single 'Species', then we could also use first

df1 %>% 
group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L,
TRUE ~ row_number())) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), Species = first(Species),
ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
<dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 5 3.25 2.8 0.4 setosa ID_1+ID_2
2 4.7 3.2 1.3 0.2 setosa ID_3
3 4.6 3.1 1.5 0.2 setosa ID_4
4 5 3.6 1.4 0.2 setosa ID_5

Or another option is to create a new level by collapsing the IDs or interest in fct_collapse

library(forcats)
df1 %>%
group_by(grp = fct_collapse(ID, other = c("ID_1", "ID_2"))) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), Species = first(Species),
ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
<dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 5 3.25 2.8 0.4 setosa ID_1+ID_2
2 4.7 3.2 1.3 0.2 setosa ID_3
3 4.6 3.1 1.5 0.2 setosa ID_4
4 5 3.6 1.4 0.2 setosa ID_5

How do I collapse staggered data to one row in R by participant number and timepoint?

Edited: a simpler version that does not depend on custom functions

using na.omit to get only valid observations (per time/partnum)

x %>% select(-event) %>% 
group_by(time, PartNum) %>%
summarise_all(na.omit)

previous version:

The following will solve your question using dplyr:

x_clean <- x %>%                       # (1)
select(-event) %>% # (2)
group_by(time, PartNum) %>% # (3)
mutate(across(.cols = everything(), # (4)
.fns = getmode)) %>%
distinct() # (5)

Each step can be understood as doing the following:
0) pick dataset x, and THEN

  1. remove variable event from data set, THEN (read %>% as "and then"
  2. group by time and PartNum, THEN
  3. mutate across all (grouped) variables and get the mode of each (per time and
    PartNum. this will replace the NAs with the most common observation per grouping.
    if you stopped here, you would get repeated rows for each grouping, so finally
  4. get only distinct rows from the resulting data set.
  5. the resulting data set is assigned to x_clean

the whole code for reproduction

## your data.frame
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))


# helper function that works for numeric and character data
# will retrieve the most common value.
getmode <- function(v, na.rm = TRUE) {
if (na.rm) v <- na.exclude(v)
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

## solution
library(tidyverse)

x_clean <- x %>% # (0)
select(-event) %>% # (1)
group_by(time, PartNum) %>% # (2)
mutate(across(.cols = everything(), # (3)
.fns = getmode)) %>%
distinct() # (4)
x_clean
#> # A tibble: 2 x 10
#> # Groups: time, PartNum [2]
#> time PartNum S1Q1 S1Q2 S1date S2Q1 S2Q2 S2date race age
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 week_1 1 3 4 2020-03-02 5 3 2020-03-02 0 60
#> 2 week_1 2 1 2 2020-03-04 3 2 2020-03-04 1 58

Collapse text by group in data frame

Simply use aggregate :

aggregate(df$text, list(df$group), paste, collapse="")
## Group.1 x
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

Or with plyr

library(plyr)
ddply(df, .(group), summarize, text=paste(text, collapse=""))
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

ddply is faster than aggregate if you have a large dataset.

EDIT :
With the suggestion from @SeDur :

aggregate(text ~ group, data = df, FUN = paste, collapse = "")
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

For the same result with earlier method you have to do :

aggregate(x=list(text=df$text), by=list(group=df$group), paste, collapse="")

EDIT2 : With data.table :

library("data.table")
dt <- as.data.table(df)
dt[, list(text = paste(text, collapse="")), by = group]
## group text
## 1: a a1a2a3
## 2: b b1b2
## 3: c c1c2c3

Select the first row by group

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
testd <- as.data.table(test)
setkey(testd,id)
# Initial solution (slow)
# testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
# Faster options :
testd[!duplicated(id)] # (1)
# testd[, .SD[1L], by=key(testd)] # (2)
# testd[J(unique(id)),mult="first"] # (3)
# testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
replications=5, order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 5 0.03 1.000 0.03 0.00
# 5 jdt() 5 0.03 1.000 0.03 0.00
# 3 gs2() 5 3.49 116.333 2.87 0.58
# 2 gs1() 5 3.58 119.333 3.00 0.58
# 4 jply() 5 3.69 123.000 3.11 0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 100 5.48 1.000 4.44 1.00
# 2 jdt() 100 6.92 1.263 5.70 1.15

from data table, randomly select one row per group

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]

Collapse rows of a dataframe with common values and fill in blanks

You can use groupby + first; first skips over NaN values by default:

collapsed_df = df.groupby("feature_id").first().reset_index()

If the empty spaces are not NaN values, probably will want to fill them with NaN first:

df = df.replace('', np.nan)


Related Topics



Leave a reply



Submit