Exceeding Memory Limit in R (Even with 24Gb Ram)

Exceeding memory limit in R (even with 24GB RAM)

To follow up on my comments, use data.table. I put together a quick example matching your data to illustrate:

library(data.table)

dt1 <- data.table(id = 1:908450, matrix(rnorm(908450*32), ncol = 32))
dt2 <- data.table(id = 1:908450, rnorm(908450))
#set keys
setkey(dt1, id)
setkey(dt2, id)
#check dims
> dim(dt1)
[1] 908450 33
> dim(dt2)
[1] 908450 2
#merge together and check system time:
> system.time(dt3 <- dt1[dt2])
user system elapsed
0.43 0.03 0.47

So it took less than 1/2 second to merge together. I took a before and after screenshot watching my memory. Before the merge, I was using 3.4 gigs of ram. When I merged together, it jumped to 3.7 and leveled off. I think you'll be hard pressed to find something more memory or time efficient than that.

Before:
Sample Image

After:Sample Image

error when trying to extract row from a table with a condition in R

Proof that your syntax is fine:

#Create a minimal, reproducible example
gene_id <- gl(3, 3, 9, labels <- letters[1:3])
start <- rep(1:3, 3)
href_pos <- data.frame(gene_id=gene_id, start=start)

d1 <- ddply(as.data.frame(href_pos), "gene_id", function(href_pos) href_pos[which.min(href_pos$start), ])
gene_id start
1 a 1
2 b 1
3 c 1

To do it with data.table as Chase suggests, this should work:

require(data.table)
HREF_POS <- data.table(href_pos)
setkey(HREF_POS, gene_id)
MINS <- HREF_POS[HREF_POS[,start] %in% HREF_POS[ ,min(start), by=gene_id]$V1,]

R loop consumes 5GB of RAM even when rm(df) is set

As @G5W posted in comments, gc() after rm(df) will free up memory.

Memory error while using write.csv

You have to understand that R functions will often copy arguments, if they modify them, as the functional programming paradigm employed by R decrees that functions don't change the objects passed in as arguments; so R copies them when changes need to be made in the course of executing a function.

If you build R with memory tracing support you can see this copying in action for any operation you are having trouble with. Using the airquality example data set, tracing memory use I see

> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> tracemem(airquality)
[1] "<0x12b4f78>"
> write.csv(airquality, "airquality.csv")
tracemem[0x12b4f78 -> 0x1aac0d8]: as.list.data.frame as.list lapply unlist which write.table eval eval eval.parent write.csv
tracemem[0x12b4f78 -> 0x1aabf20]: as.list.data.frame as.list lapply sapply write.table eval eval eval.parent write.csv
tracemem[0x12b4f78 -> 0xf8ae08]: as.list.data.frame as.list lapply write.table eval eval eval.parent write.csv
tracemem[0x12b4f78 -> 0xf8aca8]: write.table eval eval eval.parent write.csv
tracemem[0xf8aca8 -> 0xca7fe0]: [<-.data.frame [<- write.table eval eval eval.parent write.csv
tracemem[0xca7fe0 -> 0xcaac50]: [<-.data.frame [<- write.table eval eval eval.parent write.csv

So that indicates 6 copies of the data are being made as R prepares it for writing to file.

Clearly that is eating up the 24Gb of RAM you have available; the error says that R needs another 1.2Gb of RAM to complete an operation.

The simplest solution to start with would be to write the file in chunks. Write the first set of lines of data out using append = FALSE, then use append = TRUE for subsequent calls to write.csv() writing out the remaining chunks. You may need to play around with this to find an chunk size that will not exceed the available memory.

R stats - memory issues when allocating a big matrix / Linux

Let me build slightly on what @richardh said. All of the data you load with R chews up RAM. So you load your main data and it uses some hunk of RAM. Then you subset the data so the subset is using a smaller hunk. Then the regression algo needs a hunk that is greater than your subset because it does some manipulations and gyrations. Sometimes I am able to better use RAM by doing the following:

  1. save the initial dataset to disk using save()
  2. take a subset of the data
  3. rm() the initial dataset so it is no longer in memory
  4. do analysis on the subset
  5. save results from the analysis
  6. totally dump all items in memory: rm(list=ls())
  7. load the initial dataset from step 1 back into RAM using load()
  8. loop steps 2-7 as needed

Be careful with step 6 and try not to shoot your eye out. That dumps EVERYTHING in R memory. If it's not been saved, it'll be gone. A more subtle approach would be to delete the big objects that you are sure you don't need and not do the rm(list=ls()).

If you still need more RAM, you might want to run your analysis in Amazon's cloud. Their High-Memory Quadruple Extra Large Instance has over 68GB of RAM. Sometimes when I run into memory constraints I find the easiest thing to do is just go to the cloud where I can be as sloppy with RAM as I want to be.

Jeremy Anglim has a good blog post that includes a few tips on memory management in R. In that blog post Jeremy links to this previous StackOverflow question which I found helpful.

Limiting the allowed RAM for a service, possible using MaxWorkingSet

Literally, MaxWorkingSet only affect Working set, which is the amount of physical memory. To restrict of an overall memory usage, you need Job Object API. But it is danger if your program really need such memory (many codes don't consider an OutOfMemoryException and sometimes .NET runtime has strange behaviors when memory is not enough)

You need to:

  • Create a Win32 Job object
  • Set the maximum memory to the job
  • Assign your process to the job

Here is a wrapper for .NET. ^reference

Besides, you could try this method of GC: (for .NET 4.6 or newer)

GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect(2, GCCollectionMode.Forced, true, true);

(for older but sometimes doesn't work)

GC.Collect(2, GCCollectionMode.Forced);

The third param in 4.6 version of GC.Collect() is to tell runtime whether to do garbage collecting immediately. In older versions, GC.Collect() only notifies and leaves the decision to runtime.

As for some programming advice, I suggest you could wrap a class for one query. The class could be explicitly disposed after a query is done. It may help make GC smarter.

Finally, indeed there are something in .NET framework which you need to manage yourself. Like Bitmap.GetHBitmap, they need to be disposed manually.



Related Topics



Leave a reply



Submit