How to Create a Line Plot with Groups in Base R Without Loops

How to create a line plot with groups in Base R without loops?

What about something like this as a base to work from:

test = data.frame(x = rep(1:3, each = 2),
group = rep(c("Group 1", "Group 2"), 3),
group_cd = rep(c(1, 2), 3),
y = c(22, 8, 11, 4, 7, 5))

xvals <- split(test$x, test$group)
yvals <- split(test$y, test$group)

plot(1:max(unlist(xvals)), ylim = c(0, max(unlist(yvals))), type = "n")
# thanks to @BenBolker for refining this next key line
mapply(lines, xvals, yvals, col = c("red", "blue"), pch = 1:2, type = "o")

Result:

Sample Image

How to add trend line to plot for each individual without using a loop in Base R?

Here's a proposal. First fit linear models to your data (using lm). Then use these fits to plot the lines:

fits <- mapply(function(x, y) lm(y ~ as.numeric(x)),
xvals, yvals, SIMPLIFY = FALSE)
mapply(abline, fits, col = seq_along(xvals))

Sample Image

Creating base R plots in loop

There's just a minor issue in how you subset the data in your for loop, then it should work.

for(i in unique(test$ID)){

[...]

linedata <-
data.frame(Amount=test[test$ID == i, ]$Amount, Year=test[test$ID == i, ]$Year)

[...]

}

How do I plot multiple lines (by levels of factor) for year series?

Other than my comments above, here is a basic base implementation. If initially your $COUNTRY is a factor (is.factor(df$COUNTRY)), then you can skip the creation of ctryfctr and change the lines call to lines(..., col=x$COUNTRY[1]):

df$ctryfctr <- factor(df$COUNTRY)
plot(NA, xlim=range(df$YEAR), ylim=range(df$METRIC))
for (x in split(df, df$COUNTRY)) lines(x$YEAR, x$METRIC, col=x$ctryfctr[1])

sample per-country plot

Use for loop to plot multiple lines in single plot with ggplot2

Instead of ruuning a loop, you should do this the ggplot2 way.
ggplot2 wants the data in the long-format (you can convert it with reshape2::melt()). Then split the lines via a column (here Var2).

y <- matrix(rnorm(100), 10, 10)
require(reshape2)
y_m <- melt(y)

require(ggplot2)
ggplot() +
geom_line(data = y_m, aes(x = Var1, y = value, group = Var2))

Sample Image

Looping in R to create many plots when you have one extra variable

This is a well thought about topic for R, see SO posts here and here. Answers to this question highlight that *apply() alternatives to for() improve clarity, make parallelization easier, and under some circumstance speed up the problem. However, presumably your real question is ''how do I do this faster'' because it is taking long enough that you're unhappy. Inside your loop you are doing 3 distinct tasks.

  1. Break out a chunk of the dataframe using filter()
  2. Make a plot.
  3. Save the plot to a jpeg.

There are multiple ways to do all three of these steps, so let's try and evaluate all of them. I'll use the diamonds data from ggplot2 because it is bigger than the cars data. I hope differences in performance between methods will be noticeable this way. I learned alot from this chapter of Hadley Wickham's book on measuring performance.

So that I can use profiling I put the following block of code inside a function, and save that in a separate R file named for_solution.r.

f <- function(){
param <- unique(diamonds$cut)
for (i in param){
mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",i,sep=""))
ggsave(mcplt, file=paste("Cut",i,".jpeg",sep=""))
}
}

and then I do:

library(dplyr)
library(ggplot2)
source("for_solution.r",keep.source=TRUE)
Rprof(line=TRUE)
f()
Rprof(NULL)
summaryRprof(lines="show")

Examining that output I see that the block of code is spending 97.25% of the time just saving the files. Examining the source for ggsave() I can see that function is doing alot of defensive programming to identify the type of output, then opening the graphics device, printing, and then closing the device. So I wonder if doing just that step manually would help. I'm also going to take advantage of the fact that a jpeg device will automatically produce new files for each page to only open and close the device once.

f1 <- function(){
param <- unique(diamonds$cut)
jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
for (i in param){
mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",i,sep=""))
print(mcplt)
}
dev.off()
}

and now profiling again

Rprof(line=TRUE)
f1()
Rprof(NULL)
summaryRprof(lines="show")

f1() still spends most of it's time on print(mcplt), and it is slightly faster than before (1.96 seconds compared to 2.18 seconds). One possible way to speed things up is to use a smaller device (less resolution or smaller image); when I used the defaults for jpeg() the difference was larger, more like 25% faster. I also tried changing the device to png() but that was no different.

Based on the profiling, I don't expect this to help, but for completeness I'm going to try doing away with the for loop and running everything inside dplyr with do(). I found this question and this one helpful here.

jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
plots = diamonds %>% group_by(cut) %>%
do({plot=ggplot(aes(x=carat, y=price),data=.) +
geom_point() +
facet_wrap(~color) +
ggtitle(paste("Cut: ",.$cut,sep=""))
print(plot)})

dev.off()

Running that code gives

Error: Results are not data frames at positions: 1, 2, 3

but it seems to work. I believe the error arises when do() returns because the print() method isn't returning a data.frame. Profiling it seems to indicate it runs a bit faster, 1.78 seconds overall. But I don't like solutions that generate errors, even if they aren't causing problems.

I have to stop here, but I've already learned a great deal about where to focus the attention. Other things to try would include:

  1. Using parallel or something similar to run each chunk of the dataframe in a separate process. I'm not sure that would help if the problem is saving the file, but if rendering the image is done by the CPU it would, I think.
  2. Try data.table instead of dplyr, but again, it's the printing part that's slow.
  3. Try Base graphics and lattice graphics and plotly instead of ggplot2. I've no idea about the relative speed, but it could vary.
  4. Buy a faster hard drive! I just compared the speed of f() on my home computer with a regular hard drive to my work machine with an SSD -- it's about 3x slower than the timings above.


Related Topics



Leave a reply



Submit