Collapsing data frame by selecting one row per group
Maybe duplicated()
can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z
using tail(z, 1)
.
Collapse specific rows/cases of dataframe
We could create a grouping based on the presence of those ID using %in%
library(dplyr)
library(stringr)
df1 %>%
group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L,
TRUE ~ row_number()), Species) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)
-output
# A tibble: 4 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width ID
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 setosa 5 3.25 2.8 0.4 ID_1+ID_2
2 setosa 4.7 3.2 1.3 0.2 ID_3
3 setosa 4.6 3.1 1.5 0.2 ID_4
4 setosa 5 3.6 1.4 0.2 ID_5
if there is only a single 'Species', then we could also use first
df1 %>%
group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L,
TRUE ~ row_number())) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), Species = first(Species),
ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
<dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 5 3.25 2.8 0.4 setosa ID_1+ID_2
2 4.7 3.2 1.3 0.2 setosa ID_3
3 4.6 3.1 1.5 0.2 setosa ID_4
4 5 3.6 1.4 0.2 setosa ID_5
Or another option is to create a new level by collapsing the IDs or interest in fct_collapse
library(forcats)
df1 %>%
group_by(grp = fct_collapse(ID, other = c("ID_1", "ID_2"))) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), sum), Species = first(Species),
ID = str_c(ID, collapse="+"),
.groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
<dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 5 3.25 2.8 0.4 setosa ID_1+ID_2
2 4.7 3.2 1.3 0.2 setosa ID_3
3 4.6 3.1 1.5 0.2 setosa ID_4
4 5 3.6 1.4 0.2 setosa ID_5
How do I collapse staggered data to one row in R by participant number and timepoint?
Edited: a simpler version that does not depend on custom functions
using na.omit
to get only valid observations (per time/partnum)
x %>% select(-event) %>%
group_by(time, PartNum) %>%
summarise_all(na.omit)
previous version:
The following will solve your question using dplyr:
x_clean <- x %>% # (1)
select(-event) %>% # (2)
group_by(time, PartNum) %>% # (3)
mutate(across(.cols = everything(), # (4)
.fns = getmode)) %>%
distinct() # (5)
Each step can be understood as doing the following:
0) pick dataset x
, and THEN
- remove variable
event
from data set, THEN (read %>% as "and then" - group by
time
andPartNum
, THEN - mutate across all (grouped) variables and get the mode of each (per
time
andPartNum
. this will replace theNA
s with the most common observation per grouping.
if you stopped here, you would get repeated rows for each grouping, so finally - get only distinct rows from the resulting data set.
- the resulting data set is assigned to
x_clean
the whole code for reproduction
## your data.frame
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
# helper function that works for numeric and character data
# will retrieve the most common value.
getmode <- function(v, na.rm = TRUE) {
if (na.rm) v <- na.exclude(v)
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
## solution
library(tidyverse)
x_clean <- x %>% # (0)
select(-event) %>% # (1)
group_by(time, PartNum) %>% # (2)
mutate(across(.cols = everything(), # (3)
.fns = getmode)) %>%
distinct() # (4)
x_clean
#> # A tibble: 2 x 10
#> # Groups: time, PartNum [2]
#> time PartNum S1Q1 S1Q2 S1date S2Q1 S2Q2 S2date race age
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 week_1 1 3 4 2020-03-02 5 3 2020-03-02 0 60
#> 2 week_1 2 1 2 2020-03-04 3 2 2020-03-04 1 58
Collapse text by group in data frame
Simply use aggregate
:
aggregate(df$text, list(df$group), paste, collapse="")
## Group.1 x
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3
Or with plyr
library(plyr)
ddply(df, .(group), summarize, text=paste(text, collapse=""))
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3
ddply
is faster than aggregate
if you have a large dataset.
EDIT :
With the suggestion from @SeDur :
aggregate(text ~ group, data = df, FUN = paste, collapse = "")
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3
For the same result with earlier method you have to do :
aggregate(x=list(text=df$text), by=list(group=df$group), paste, collapse="")
EDIT2 : With data.table
:
library("data.table")
dt <- as.data.table(df)
dt[, list(text = paste(text, collapse="")), by = group]
## group text
## 1: a a1a2a3
## 2: b b1b2
## 3: c c1c2c3
Select the first row by group
You can use duplicated
to do this very quickly.
test[!duplicated(test$id),]
Benchmarks, for the speed freaks:
ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
testd <- as.data.table(test)
setkey(testd,id)
# Initial solution (slow)
# testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
# Faster options :
testd[!duplicated(id)] # (1)
# testd[, .SD[1L], by=key(testd)] # (2)
# testd[J(unique(id)),mult="first"] # (3)
# testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc
}
library(plyr)
library(data.table)
library(rbenchmark)
# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), gs1(), gs2(), jply(), jdt(),
replications=5, order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 5 0.03 1.000 0.03 0.00
# 5 jdt() 5 0.03 1.000 0.03 0.00
# 3 gs2() 5 3.49 116.333 2.87 0.58
# 2 gs1() 5 3.58 119.333 3.00 0.58
# 4 jply() 5 3.69 123.000 3.11 0.51
Let's try that again, but with just the contenders from the first heat and with more data and more replications.
set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 100 5.48 1.000 4.44 1.00
# 2 jdt() 100 6.92 1.263 5.70 1.15
from data table, randomly select one row per group
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample
1 row from the sequence of rows per group, get the row index (.I
), extract the column with the row index ($V1
) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
Collapse rows of a dataframe with common values and fill in blanks
You can use groupby
+ first
; first
skips over NaN values by default:
collapsed_df = df.groupby("feature_id").first().reset_index()
If the empty spaces are not NaN values, probably will want to fill them with NaN first:
df = df.replace('', np.nan)
Related Topics
How to Use R with Google Colaboratory
Tidyverse Pivot_Longer Several Sets of Columns, But Avoid Intermediate Mutate_Wider Steps
How to Display All X Labels in R Barplot
Longest Common Substring in R Finding Non-Contiguous Matches Between the Two Strings
Floating Point Less-Than-Equal Comparisons After Addition and Substraction
Ggplot2 Pie and Donut Chart on Same Plot
Creating Dummy Variables in R Data.Table
Why True == "True" Is True in R
Determine the Data Types of a Data Frame's Columns
Dplyr Mutate Rowwise Max of Range of Columns
Why Does R Use Partial Matching
Re-Ordering Factor Levels in Data Frame
Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too
Remove Groups with Less Than Three Unique Observations