Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
Yet another option is ave
. For good measure, I've collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer using ave
: ave(df$x, df$index, FUN = function(z) z/sum(z))
. I also show an example using data.table
package since it is usually pretty quick, but I know you're looking for base solutions, so you can ignore that if you want.
And now a bunch of timings:
library(data.table)
library(plyr)
dt <- data.table(df)
plyr <- function() ddply(df, .(index), transform, z = x / sum(x))
av <- function() ave(df$x, df$index, FUN = function(z) z/sum(z))
t.apply <- function() unlist(tapply(df$x, df$index, function(x) x/sum(x)))
l.apply <- function() unlist(lapply(split(df$x, df$index), function(x){x/sum(x)}))
b.y <- function() unlist(by(df$x, df$index, function(x){x/sum(x)}))
agg <- function() aggregate(df$x, list(df$index), function(x){x/sum(x)})
d.t <- function() dt[, x/sum(x), by = index]
library(rbenchmark)
benchmark(plyr(), av(), t.apply(), l.apply(), b.y(), agg(), d.t(),
replications = 1000,
columns = c("test", "elapsed", "relative"),
order = "elapsed")
#-----
test elapsed relative
4 l.apply() 0.052 1.000000
2 av() 0.168 3.230769
3 t.apply() 0.257 4.942308
5 b.y() 0.694 13.346154
6 agg() 1.020 19.615385
7 d.t() 2.380 45.769231
1 plyr() 5.119 98.442308
the lapply()
solution seems to win in this case and data.table()
is surprisingly slow. Let's see how this scales to a bigger aggregation problem:
df <- data.frame(x = sample(1:100, 1e5, TRUE), index = gl(1000, 100))
dt <- data.table(df)
#Replication code omitted for brevity, used 100 replications and dropped plyr() since I know it
#will be slow by comparison:
test elapsed relative
6 d.t() 2.052 1.000000
1 av() 2.401 1.170078
3 l.apply() 4.660 2.270955
2 t.apply() 9.500 4.629630
4 b.y() 16.329 7.957602
5 agg() 20.541 10.010234
that seems more consistent with what I'd expect.
In summary, you've got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.
Edit - and an example with 1e7 rows
Probably not large enough for Matt, but as big as my laptop can handle without crashing:
df <- data.frame(x = sample(1:100, 1e7, TRUE), index = gl(10000, 1000))
dt <- data.table(df)
#-----
test elapsed relative
6 d.t() 0.61 1.000000
1 av() 1.45 2.377049
3 l.apply() 4.61 7.557377
2 t.apply() 8.80 14.426230
4 b.y() 8.92 14.622951
5 agg() 18.20 29.83606
How to find average of a col1 grouping by col2
With a data.frame named dat
that looked like:
rank name country category sales profits assets marketvalue
21 21 DaimlerChrysler Germany Consumer_dur 157.13 5.12 195.58 47.43
Try (untested d/t the numerous spaces in the text preventing read.table from making sense of it):
aggregate(dat[ , c("sales", "profits", "assets", "marketvalue")], # cols to aggregate
dat["country"], # group column
FUN=mean) # aggregation function
How to get standardized column for specific rows only?
Using ave
might be a good option here:
Get your data:
test <- read.csv(textConnection("Score,Quarter
98.7,Round 1 2011
88.6,Round 1 2011
76.5,Round 1 2011
93.5,Round 2 2011
97.7,Round 2 2011
89.1,Round 1 2012
79.4,Round 1 2012
80.3,Round 1 2012"),header=TRUE)
scale
the data within each Quarter
group:
test$score_scale <- ave(test$Score,test$Quarter,FUN=scale)
test
Score Quarter score_scale
1 98.7 Round 1 2011 0.96866054
2 88.6 Round 1 2011 0.05997898
3 76.5 Round 1 2011 -1.02863953
4 93.5 Round 2 2011 -0.70710678
5 97.7 Round 2 2011 0.70710678
6 89.1 Round 1 2012 1.15062301
7 79.4 Round 1 2012 -0.65927589
8 80.3 Round 1 2012 -0.49134712
Just to make it obvious that this works, here are the individual results for each Quarter
group:
> as.vector(scale(test$Score[test$Quarter=="Round 1 2011"]))
[1] 0.96866054 0.05997898 -1.02863953
> as.vector(scale(test$Score[test$Quarter=="Round 2 2011"]))
[1] -0.7071068 0.7071068
> as.vector(scale(test$Score[test$Quarter=="Round 1 2012"]))
[1] 1.1506230 -0.6592759 -0.4913471
I am trying to sum probes count within subgroup of a dataframe in R
One way with the dplyr
package would be the following. Your data frame is called mydf
.
library(dplyr)
group_by(mydf, Patient, Chrom) %>%
mutate(whatever = sum(ProbeCount))
#Source: local data frame [8 x 6]
#Groups: Patient, Chrom
#
# Patient Chrom Start End ProbeCount whatever
#1 1 1 51599 62640 8 8684
#2 1 1 88466 16022503 8676 8684
#3 1 2 2785 285255 186 3089
#4 1 2 290880 4178544 2903 3089
#5 2 1 51599 4098530 1282 26511
#6 2 1 4101675 46753618 25229 26511
#7 2 2 2785 36178040 25931 25952
#8 2 2 36185342 36192717 21 25952
If your data is large, you may want to use the data.table
.
library(data.table)
setDT(mydf)[, whatever := sum(ProbeCount), by = list(Patient, Chrom)][]
# Patient Chrom Start End ProbeCount whatever
#1: 1 1 51599 62640 8 8684
#2: 1 1 88466 16022503 8676 8684
#3: 1 2 2785 285255 186 3089
#4: 1 2 290880 4178544 2903 3089
#5: 2 1 51599 4098530 1282 26511
#6: 2 1 4101675 46753618 25229 26511
#7: 2 2 2785 36178040 25931 25952
#8: 2 2 36185342 36192717 21 25952
How can I operate on elements of a data.frame in r, that creates a new column?
library(dplyr)
df <- read.table(text = "a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
" , header = T)
df %>%
group_by(a , b) %>%
mutate(m = mean(d))
error when trying to extract row from a table with a condition in R
Proof that your syntax is fine:
#Create a minimal, reproducible example
gene_id <- gl(3, 3, 9, labels <- letters[1:3])
start <- rep(1:3, 3)
href_pos <- data.frame(gene_id=gene_id, start=start)
d1 <- ddply(as.data.frame(href_pos), "gene_id", function(href_pos) href_pos[which.min(href_pos$start), ])
gene_id start
1 a 1
2 b 1
3 c 1
To do it with data.table
as Chase suggests, this should work:
require(data.table)
HREF_POS <- data.table(href_pos)
setkey(HREF_POS, gene_id)
MINS <- HREF_POS[HREF_POS[,start] %in% HREF_POS[ ,min(start), by=gene_id]$V1,]
Add a new column of the sum by group
I agree with @mnel at least on his first point. I didn't see ave
demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...))
construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6
estimate frequency for multiple subsets of data frame in R
We can use data.table
library(data.table)
setDT(df)[, freq:= val/sum(val) , by = fac2]
df
# fac1 fac2 val freq
#1: a x 10 0.1666667
#2: b x 20 0.3333333
#3: c x 30 0.5000000
#4: a y 40 0.2666667
#5: b y 50 0.3333333
#6: c y 60 0.4000000
#7: a z 70 0.2916667
#8: b z 80 0.3333333
#9: c z 90 0.3750000
Or using base R
df$freq <- with(df, val/ave(val, fac2, FUN=sum))
Related Topics
Calculate Cumulative Sum (Cumsum) by Group
Filtering a Data Frame on a Vector
Offline Install of R Package and Dependencies
Dplyr: "Error in N(): Function Should Not Be Called Directly"
Create Group Number For Contiguous Runs of Equal Values
Count Nas Per Row in Dataframe
Ggplot2 Change Axis Limits For Each Individual Facet Panel
How to Uninstall R and Rstudio With All Packages, Settings and Everything Else
How to Flatten/Merge Overlapping Time Periods
How to Change Multiple Date Formats in Same Column
Number of Months Between Two Dates
How to Perform Natural (Lexicographic) Sorting in R
How to Open CSV File in R When R Says "No Such File or Directory"