sum multiple columns by group with tapply
tapply
works on a vector, for a data.frame you can use by
(which is a wrapper for tapply
, take a look at the code):
> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
apples cherries plums
111 222 333
-------------------------------------------------------------------------------------
df.1$state: BB
apples cherries plums
-111 -222 -333
How to run tapply() on multiple columns of data frame using R?
That's because tapply works on vectors, and transforms df[,2:10] to a vector. Next to that, sum will give you the total sum, not the sum per column. Use aggregate()
, eg :
aggregate(df[,2:10],by=list(df$a), sum)
If you want a list returned, you could use by() for that. Make sure to specify colSums instead of sum, as by works on a splitted dataframe :
by(df[,2:10],df$a,FUN=colSums)
groupby multiple columns and sum then sum in dplyr
The code I posted works, the issue turns out to be in my package version. It will be solved if specify the package name.
df %>%
group_by(v1, v2, v3) %>%
dplyr::summarise(sumv4 = sum(v4))
Apply multiple functions to column using tapply
You can certainly do stuff like this using ddply
from the plyr
package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW
on an internal variable in ddply
called piece
. You could have just done something like length(y)
instead. (And probably should; referencing the internal variable piece
isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length()
.)
Add multiple columns with the same group and sum
If I have understood you well, this will solve your problem:
narc_auth_total <-
narc_auth %>%
group_by(Full.Name) %>%
summarise(
`2019_words` = sum(`2019`),
`2020_words` = sum(`2020`)
) %>%
left_join(totaltweetsyear, ., by = "Full.Name")
Using tapply and cumsum function for multiple vectors in R
if there are more than one group, wrap it in a list
, but note that tapply
in a summarising function and it can split up when we specify function like cumsum
.
tapply(date_country$n, list(date_country$country, date_country$pangolin_lineage), cumsum))
But, this is much more easier with ave
i.e. if we want to create a new column, avoid the hassle of unlist
etc. by just using ave
ave(date_country$n, date_country$country,
date_country$pangolin_lineage, FUN = cumsum)
#[1] 1 2 3 1 4 1
Grouping functions (tapply, by, aggregate) and the *apply family
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns
of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quickcolMeans
,rowMeans
,colSums
,rowSums
.lapply - When you want to apply a function to each element of a
list in turn and get a list back.This is the workhorse of many of the other *apply functions. Peel
back their code and you will often findlapply
underneath.x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005sapply - When you want to apply a function to each element of a
list in turn, but you want a vector back, rather than a list.If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005In more advanced uses of
sapply
it will attempt to coerce the
result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use
sapply
but perhaps need to
squeeze some more speed out of your code or want more type safety.For
vapply
, you basically give R an example of what sort of thing
your function will return, which can save some time coercing returned
values to fit in a single atomic vector.x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as insapply
.This is multivariate in the sense that your function must accept
multiple arguments.#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")tapply - For when you want to apply a function to subsets of a
vector and the subsets are defined by some other vector, usually a
factor.The black sheep of the *apply family, of sorts. The help file's use of
the phrase "ragged array" can be a bit confusing, but it is actually
quite simple.A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:tapply(x, y, sum)
a b c d e
10 26 42 58 74More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors.tapply
is
similar in spirit to the split-apply-combine functions that are
common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its
black sheep status.
Related Topics
Create Polygon from Set of Points Distributed
Warning: Unable to Access Index for Repository Https://Www.Stats.Ox.Ac.Uk/Pub/Rwin/Src/Contrib:
R, Deep VS. Shallow Copies, Pass by Reference
Converting Date to a Day of Week in R
How to Change Node and Link Colors in R Googlevis Sankey Chart
How to Call the 'Function' Function
Can Lapply Not Modify Variables in a Higher Scope
Data.Table Error When Used Through Knitr, Gwidgetswww
R: How to Create a Vector of Functions
Reproduce a 'The Economist' Chart with Dual Axis
Creating a Sankey Diagram Using Networkd3 Package in R
How to Select All Unique Combinations of Two Columns in an R Data Frame
Function Commenting Conventions in R
R: Split Elements of a List into Sublists
Error in R Gbm Function When Cv.Folds > 0
How to Add a Non-Overlapping Legend to Associate Colors with Categories in Pairs()