Memory Profiling in R - Tools for Summarizing

Memory profiling in R - tools for summarizing

profvis looks like the the solution to this question.

It generates an interactive .html file (using htmlwidgets) showing the profiling of your code.

The introduction vignette is a good guide on its capability.

Taking directly from the introduction, you would use it like this:

devtools::install_github("rstudio/profvis")
library(profvis)

# Generate data
times <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
profvis({
data1 <- data # Store in another variable for this run

# Get column means
means <- apply(data1[, names(data1) != "id"], 2, mean)

# Subtract mean from each column
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
}
}, height = "400px")

Which gives

Sample Image

Interpretation of memory profiling output of `Rprof`

?summaryRprof says:

If memory = "both" the same list but with memory consumption in Mb in addition to the timings.

So mem.total is in MB

With memory = "both" the change in total memory (truncated at zero) is reported [...]

You have 8 GB RAM + 12 GB swap but mem.total proclaims you have used 50 GB?

Because it is the aggregated delta between two subsequent probes (memory usage snapshots taken by Rprof in regular time intervals: If a probe is taken while the execution is in function f the memory usage delta to the last probe is added to the mem.total of f).

The memory usage delta could be negative but I have never seen negative mem.total values so I am guessing (!) only positive values are added to mem.total.

This would explain the 50 GB total usage you are seeing: It is not the amount of allocated memory during a single point of time but the aggregated memory delta during the complete execution time.

This also explains the fact that gc only shows 3 GB as "max used (Mb)": The memory is allocated and freed/deallocated many times so that you do not run into memory pressure but this costs a lot of time (moving so much data in RAM invalidates all caches and is slow therefore) on top of the calculation logic the CPU applies.

This summary (IMHO) also seems to hide the fact that the garbage collector (gc) is starting at non-deterministic points in time to clean-up freed memory.

Since the gc starts lazy (non-deterministically) it would IMHO be unfair to attribute the negative memory deltas to a single function just probed.

I would interpret mem.total as mem.total.used.during.runtime which would possibly be a better label for the column.

profvis has a more detailed memory usage summary (as you can see in your screen shot in your question): It also aggregates the negative memory usage deltas (freed memory) but the profvis documentation also warns about the short-comings:

The code panel also shows memory allocation and deallocation.
Interpreting this information can be a little tricky, because it does
not necessarily reflect memory allocated and deallcated at that line
of code. The sampling profiler records information about memory
allocations that happen between the previous sample and the current
one. This means that the allocation/deallocation values on that line
may have actually occurred in a previous line of code.

A more details answer would require more research time (I don't have)
- to look into the C and R source
- to understand (replicate) the aggregation logic of summaryRprof based on the data files created by Rprof

Rprof data files (Rprof.out) look like this:

:376447:6176258:30587312:152:1#2 "test" 1#1 "test2"

The first four numbers (separated by colons) mean (see ?summaryRprof)
- R_SmallVallocSize: the vector memory [number of buckets] in small blocks on the R heap
- R_LargeVallocSize: the vector memory [number of buckets] in large blocks (from malloc)
- the memory in nodes on the R heap
- the number of calls to the internal function duplicate in the time interval (used to duplicate vectors eg. in case of copy-on-first-write semantics of function arguments)

The strings are the function call stack.

Only the first two number are relevant to calculate the current memory usage (of vectors) in MB:

TotalBuckets = R_SmallVallocSize + R_LargeVallocSize
mem.used = TotalBuckets * 8 Bytes / 1024 / 1024
# 50 MB in the above `Rprof` probe line:
# (376447 + 6176258) * 8 / 1024 / 1024

For details about the Vcells see ?Memory.

BTW: I wanted to try out summaryRProf(memory = "stats", diff = F) to get a current memory summary but I get an error message with R3.4.4 64-bits on Ubuntu:

Error in tapply(seq_len(1L), list(index = c("1::#File", "\"test2\":1#1",  : 
arguments must have same length

Can you reproduce this (looks like "stats" is broken)?

Compiling R packages for a memory-profiling configuration

Short answer is 'Nope' as packages are unaffected by this optional feature in the R engine.

If you have particular questions concerning R use on Debian and Ubuntu, come to the r-sig-debian list.



Related Topics



Leave a reply



Submit