R Dataframe: Aggregating Strings Within Column, Across Rows, by Group

R Dataframe: aggregating strings within column, across rows, by group

Here are two ways:

base R

aggregate(
text ~ page + passage + person,
data=df,
FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>%
group_by_(~page, ~passage, ~person) %>%
summarize_(text=~paste(text, collapse=' '))

How to aggregate characters strings by group in R?

An option is to group by 'DocID', fill the columns 'ElementA', 'ElementB' with adjacent non-NA elements and get the distinct rows

library(dplyr)
library(tidyr)
df1 %>%
group_by(DocID) %>%
fill(ElementA, ElementB, .direction = "downup") %>%
ungroup %>%
distinct

-output

# A tibble: 3 x 3
# DocID ElementA ElementB
# <int> <chr> <chr>
#1 1 A1 B1
#2 2 A2 B2
#3 3 A3 B3

data

df1 <- structure(list(DocID = c(1L, 1L, 2L, 2L, 3L, 3L), ElementA = c("A1", 
NA, "A2", NA, "A3", NA), ElementB = c(NA, "B1", NA, "B2", NA,
"B3")), class = "data.frame", row.names = c(NA, -6L))


How to list row values in a column based on grouping value in R?

I would suggest next base R approach:

#Data
df <- structure(list(GeneID = c("am1001", "am1001", "am1002", "am1002",
"am1002"), GO = c(190909L, 600510L, 500050L, 432323L, 100209L
)), class = "data.frame", row.names = c(NA, -5L))

The code:

#Aggregation
aggregate(GO~GeneID,data=df,FUN = function(x) paste0(x,collapse = '; '))

The output:

  GeneID                     GO
1 am1001 190909; 600510
2 am1002 500050; 432323; 100209

Concatenate strings by group with dplyr

You could simply do

data %>% 
group_by(foo) %>%
mutate(bars_by_foo = paste0(bar, collapse = ""))

Without any helper functions

Aggregate Data Frame Containing Strings and Numbers

With dplyr, we can do multiple aggregates on blocks of columns by group. The 'IDENTIFICATION' values are showed to be different, based on the expected, we can select the first element of that column for each group

library(dplyr) # >= 1.0.0
df1 %>%
group_by(COUNTY, COMMON_FIELD) %>%
# // use across for more than one column
# // checks the type of columns i.e. numeric to select and return the sum
summarise(across(where(is.numeric), sum, na.rm = TRUE),
IDENTIFICATION = first(IDENTIFICATION))

The OP's original dataset code can be changed to

GAcatalistDupes %>% 
group_by(FIPS, CAT_JOIN) %>%
# // summarise numeric columns
summarise(across(where(is.numeric), sum, na.rm = TRUE),
# // get the first value for specified columns
across(c(geography, CONG, SS, SH, Field23, FIPS), first))

Collapse text by group in data frame

Simply use aggregate :

aggregate(df$text, list(df$group), paste, collapse="")
## Group.1 x
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

Or with plyr

library(plyr)
ddply(df, .(group), summarize, text=paste(text, collapse=""))
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

ddply is faster than aggregate if you have a large dataset.

EDIT :
With the suggestion from @SeDur :

aggregate(text ~ group, data = df, FUN = paste, collapse = "")
## group text
## 1 a a1a2a3
## 2 b b1b2
## 3 c c1c2c3

For the same result with earlier method you have to do :

aggregate(x=list(text=df$text), by=list(group=df$group), paste, collapse="")

EDIT2 : With data.table :

library("data.table")
dt <- as.data.table(df)
dt[, list(text = paste(text, collapse="")), by = group]
## group text
## 1: a a1a2a3
## 2: b b1b2
## 3: c c1c2c3

Collapse / concatenate / aggregate a column to a single comma separated string within each group

Here are some options using toString, a function that concatenates a vector of strings using comma and space to separate components. If you don't want commas, you can use paste() with the collapse argument instead.

data.table

# alternative using data.table
library(data.table)
as.data.table(data)[, toString(C), by = list(A, B)]

aggregate This uses no packages:

# alternative using aggregate from the stats package in the core of R
aggregate(C ~., data, toString)

sqldf

And here is an alternative using the SQL function group_concat using the sqldf package :

library(sqldf)
sqldf("select A, B, group_concat(C) C from data group by A, B", method = "raw")

dplyr A dplyr alternative:

library(dplyr)
data %>%
group_by(A, B) %>%
summarise(test = toString(C)) %>%
ungroup()

plyr

# plyr
library(plyr)
ddply(data, .(A,B), summarize, C = toString(C))

Search and combine text from the same columns related to a specific variable in R

You can try this

> aggregate(. ~ ID, unique(D), c)
ID VAR
1 1 A, B
2 2 C, D
3 3 E
4 4 F
5 5 G

or

> aggregate(. ~ ID, unique(D), toString)
ID VAR
1 1 A, B
2 2 C, D
3 3 E
4 4 F
5 5 G


Related Topics



Leave a reply



Submit