Aggregate a Data Frame Based on Unordered Pairs of Columns

Aggregate a data frame based on unordered three columns

Use apply to sort the first three columns rowwise and then use aggregate to obtain the sum of count

aggregate(d$count, as.data.frame(t(apply(d[1:3], 1, sort))), sum)
# V1 V2 V3 x
#1 A B C 3
#2 A D N 4

OR

d %>%
rowwise() %>%
mutate(grp = toString(sort(c(X1, X2, X3)))) %>%
ungroup() %>%
group_by(grp) %>%
summarize(count = sum(count)) %>%
separate(grp, into = c("x1", "x2", "x3"))
## A tibble: 2 x 4
# x1 x2 x3 count
# <chr> <chr> <chr> <int>
#1 A B C 3
#2 A D N 4

OR

f = as.character(sort(unique(unlist(d[1:3]))))
d %>%
rowwise() %>%
mutate(grp = toString(f[f %in% c(X1, X2, X3)])) %>%
ungroup() %>%
group_by(grp) %>%
summarize(count = sum(count)) %>%
separate(grp, into = c("x1", "x2", "x3"))
## A tibble: 2 x 4
# x1 x2 x3 count
# <chr> <chr> <chr> <int>
#1 A B C 3
#2 A D N 4

Aggregate a data frame based on unordered pairs of columns

One way is to create extra columns with pmax and pmin of id1 and id2as follows. I'll use data.table solution here.

require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)),
by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
# id1 id2 size
# 1: 5400 5505 18
# 2: 5033 5458 1
# 3: 5452 2873 24
# 4: 5452 5213 2
# 5: 5452 4242 26
# 6: 4823 4823 4

Summarize columns in R Data Frame independent of the order, (df$A,df$B) = (df$B,df$A)

How about this?

library(tidyverse)
df %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
mutate(grp = paste0(sort(c(destiny, origin)), collapse = "_")) %>%
ungroup() %>%
group_by(grp) %>%
summarise(Count = sum(Count)) %>%
separate(grp, into = c("destiny", "origin"))
# # A tibble: 4 x 3
# destiny origin Count
# <chr> <chr> <int>
#1 KJFK SAEZ 4683
#2 KJFK SBBR 6
#3 KJFK SBGL 4989
#4 KJFK SBGR 12911

Note that since you don't care about the order of destiny, origin, so here we order them alphabetically. So in the example you give above, KJFK -> SBBR and SBBR -> KJFK will become destiny = KJFK, origin = SBBR.


Sample data

df <- read.table(text =
" destiny origin Count
1 KJFK SBBR 4
2 KJFK SAEZ 4683
3 SBGL KJFK 2
4 SBBR KJFK 2
5 KJFK SBGL 4987
6 KJFK SBGR 12911", header =T)

Combining columns and count combinations (pairs)

If we always write the earliest letter in the alphabet first in the pair := assignment, the code will produce the desired result. We'll use ifelse() to decide whether to write V1 before V2 as follows.

library(data.table)
set.seed(126)
dt <- data.table(V1 = sample(LETTERS[1:4], 30, replace = T),
V2 = sample(LETTERS[1:4], 30, replace = T))


# adjusted version where first letter always < second letter

#Exclude rows with the same name
dt <- dt[V1 != V2]

#Create pairs by combining V1 and V2
dt[, pair := ifelse(V1 < V2,paste(V1, V2, sep="_"), paste(V2, V1, sep = "_"))]

#Count the pairs
dt[, .N, by=.(pair)]

...and the output:

> #Count the pairs 
> dt[, .N, by=.(pair)]
pair N
1: A_C 3
2: B_C 9
3: C_D 5
4: A_B 4
5: B_D 3
6: A_D 1
>

Take Sum of a Variable if Combination of Values in Two Other Columns are Unique

We could either use the base R method by first sorting the first two columns by row. We use apply with MARGIN=1 to do that, transpose the output, convert to 'data.frame' to create 'df1', use the formula method of aggregate to get the sum of 'num_email' grouped by the first two columns of the transformed dataset.

df1 <- data.frame(t(apply(df[1:2], 1, sort)), df[3])
aggregate(num_email~., df1, FUN=sum)

# X1 X2 num_email
# 1 Beth Mable 2
# 2 Beth Susan 3
# 3 Mable Susan 1

Or using data.table, we convert the first two columns to character class, unname to change the column names of the first two columns to the default 'V1', 'V2', and convert to 'data.table'. Using the lexicographic ordering of character columns, we create the logical index for i (V1 > V2), assign (:=) the columns that meet the condition by reversing the order of columns (.(V2, V1)), and get the sum of 'num_email' grouped by 'V1', 'V2'.

library(data.table)
dt = do.call(data.table, c(lapply(unname(df[1:2]), as.character), df[3]))
dt[V1 > V2, c("V1", "V2") := .(V2, V1)]
dt[, .(num_email = sum(num_email)), by= .(V1, V2)]

# V1 V2 num_email
# 1: Beth Mable 2
# 2: Beth Susan 3
# 3: Mable Susan 1

Or using dplyr, we use mutate_each to convert the columns to character class, then reverse the order with pmin and pmax, group by 'V1', 'V2' and get the sum of 'num_email'.

library(dplyr)
df %>%
mutate_each(funs(as.character), senders, receivers) %>%
mutate( V1 = pmin(senders, receivers),
V2 = pmax(senders, receivers) ) %>%
group_by(V1, V2) %>%
summarise(num_email=sum(num_email))

# V1 V2 num_email
# (chr) (chr) (dbl)
# 1 Beth Mable 2
# 2 Beth Susan 3
# 3 Mable Susan 1

NOTE: The data.table solution was updated by @Frank.

deedle aggregate/group based on running numbers in a column of Frame

I belive a recursive function is your friend here.

Feed a list of tuples

let data = [(Joe, 51, 50);
(Tomas, 52, 100);
(Eve, 65, 20);
(Suzanne, 67, 10)]

to a function

let groupBySequencialId list =
let rec group result acc data lastId =
match data with
| [] -> acc :: result
| (name, id, amount) :: tail ->
if lastId + 1 = id then
group result ((name, id, amount) :: acc) tail id
else
group (acc :: result) ([(name, id, amount)]) tail id

group [] [] data 0

and you'll get the result you are looking for.
Grouped result

This should get the job done save three caveats.

  1. You need to parse your string into the tuples required
  2. There's an empty list in the result set because the first recursion wont match and appends the empty accumulator to the result set
  3. The list will come out be reversed

Also note that this is a highly specialized function.
If I was you, I'd try to make this more general, if you ever plan on reusing it.

Have fun.

R aggregate columns based on a factor column's value and create a new data frame

You can do this easily with data.table:

library(data.table)
options(stringsAsFactors=F)
##
dt <- data.table(
matrix(rep(1,96),ncol=16))
dt[,grp:=c(
"a-4","a-3","a-6",
"a-5","a-3","a-5")]
##
> dt[,lapply(.SD,sum),by=grp]
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: a-4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2: a-3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3: a-6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4: a-5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Edit

Here's how I would try to visualize the data. I'm going to use a slightly different set of data - same structure, different numbers:

library(data.table)
library(ggplot2)
options(stringsAsFactors=F)
##
dt <- data.table(
matrix(1:96,ncol=16))
dt[,grp:=c(
"a-4","a-3","a-6",
"a-5","a-3","a-5")]
##
gt <- dt[,lapply(.SD,sum),by=grp]
> gt
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: a-4 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91
2: a-3 7 19 31 43 55 67 79 91 103 115 127 139 151 163 175 187
3: a-6 3 9 15 21 27 33 39 45 51 57 63 69 75 81 87 93
4: a-5 10 22 34 46 58 70 82 94 106 118 130 142 154 166 178 190

Start by reshaping the data.table from wide format to long format:

gt_long <- reshape(
gt,
direction="long",
varying=list(names(gt)[-1]),
v.names="Value",
idvar="grp",
timevar="V_ID",
times=paste0("V",1:16))
> head(gt_long)
grp V_ID Value
1: a-4 V1 1
2: a-3 V1 7
3: a-6 V1 3
4: a-5 V1 10
5: a-4 V2 7
6: a-3 V2 19

Then you can treat the Vis as factor variables, and you have some options with ggplot2:

ggplot(
data=gt_long,
aes(x=V_ID,y=Value,color=grp))+
geom_point(size=5,alpha=.75)+
scale_colour_brewer(type="div",palette=4)

Sample Image

Or, if that is too cluttered for you:

ggplot(
data=gt_long,
aes(x=V_ID,y=Value,color=grp))+
geom_point(size=4)+
facet_grid(grp ~ .)

Sample Image

Edit 2
There might be a slightly more concise way to order the levels correctly, but this works. I made a copy of the gt_long object just so I could check that it worked without modifying the original, but you can just use the original object.

gt_long2 <- copy(gt_long)
v_levels <- unique(gt_long2$V_ID)
gt_long2[,V_ID:=factor(
V_ID,
levels=v_levels,
labels=v_levels)]

I'm not going to add the plots, but I reran them with gt_long2 and it looked good.



Related Topics



Leave a reply



Submit