R:Loops to Process Large Dataset(Gbs) in Chunks

R:Loops to process large dataset(GBs) in chunks?

Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
if (nrow(data) == 0)
break
## process chunk 'data' here, then...
## ...read next chunk
if (nrow(data) != nrows) # last chunk was final chunk
break
data <- tryCatch({
read.table(con, nrows=nrows, skip=0, header=FALSE)
}, error=function(err) {
## matching condition message only works when message is not translated
if (identical(conditionMessage(err), "no lines available in input"))
data.frame()
else stop(err)
})
}
close(con)

Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

Row-wise Manipulation of Large Files

I can't say I've done this myself before, but I think this should work.

library( data.table )

# set the input and output files
input.file <- "foo.csv"
output.file <- sub( "\\.csv$", "_output\\.csv", input.file )

# get column names by importing the first few lines
column.names <- names( fread( input.file, header = TRUE, nrows = 3L ) )

# write those column names as a line of text (header)
cat( paste( c( column.names, "MM" ), collapse = "," ),
file = output.file, append = FALSE )
cat( "\n", file = output.file, append = TRUE )

# decide how many rows to read at a time
rows.at.a.time <- 1E4L

# begin looping
start.row <- 1L
while( TRUE ) {

# read in only the specified lines
input <- fread( input.file,
header = FALSE,
skip = start.row,
nrows = rows.at.a.time
)

# stop looping if no data was read
if( nrow( input ) == 0L ) break

# create the "MM" column
input[ , MM := rowSums( .SD[ , 5:7 ] ) ]

# append the data to the output file
fwrite( input,
file = output.file,
append = TRUE, col.names = FALSE )

# bump the `start.row` parameter
start.row <- start.row + rows.at.a.time

# stop reading if the end of the file was reached
if( nrow( input ) < rows.at.a.time ) break

}

UPDATE: to preserve character strings, you can import all columns as character by specifying in the fread call within the loop:

colClasses = rep( "character", 280 )

Then, to perform the row sums (since you now have all character columns), you need to include a conversion there. The following would replace the single line (the one with this same comment above it) in the code:

# create the "MM" column
input[ , MM := .SD[ , 5:7 ] %>%
lapply( as.numeric ) %>%
do.call( what = cbind ) %>%
rowSums()
]

Where 5:7 is specified here, you could replace with any vector of column references to be passed to rowSums()

Note if using the above with %>% pipes, you'll need library(magrittr) at the top of your code to load the function.

read.table in Chunks - error message

Ah got it!

repeat{
temp <-read.table(infile, header = FALSE, nrows=10, sep=",", stringsAsFactors=FALSE)

temp <- data.table(temp)
setnames(temp, colnames(temp), headers)
setkey(temp, Id)
print(temp[1, Tags])

if (nrow(temp) < 10) break
}

print("hi")

This still produces warning message but no more errors:

Warning message:
In read.table(infile, header = FALSE, nrows = 10, sep = ",", stringsAsFactors = FALSE) :
incomplete final line found by readTableHeader on 'data/temp.csv'

dpylr - mutate with seq_along memory issues large dataset

We can use row_number() from dplyr

library(dplyr)
df %>%
group_by(name) %>%
mutate(id2 = row_number())
# A tibble: 9 x 3
# Groups: name [2]
# V2 name id2
# <int> <chr> <int>
#1 1 A_185 1
#2 8 A_185 2
#3 17 A_185 3
#4 25 A_185 4
#5 33 A_185 5
#6 1 A_123 1
#7 5 A_123 2
#8 13 A_123 3
#9 23 A_123 4

Or make it more faster with := from data.table

library(data.table)
setDT(df)[, id2 := seq_len(.N), by = name]

The Error Reached total allocation... when using fread function

The error messages actually give you the direct cause. You are trying to read a 30Gb file into 4Gb of RAM. The expensive solution is to upgrade your machine to 32GB of RAM.

Unfortunately, R keeps the entire environment in RAM at all times.

The less expensive solution is to process the dataset in chunks.

You will find some help here and also in this article

Work with large raster mosaics in R without merging them to a single file (like lidR catalog)

You should look into the terra package which provides exactly the functionality you're looking for through virtual raster tiles (VRTs). We can use them to treat a collection of raster files on disk as a single raster file while taking advantage of the API to do a majority of the same tasks as you can do through the raster package.

First, let's create a sample of 4 rasters using the example straight from the ?terra::vrt() documentation.

library(terra)

r <- rast(ncols=100, nrows=100)
values(r) <- 1:ncell(r)
x <- rast(ncols=2, nrows=2)
filename <- paste0(tempfile(), "_.tif")
ff <- makeTiles(r, x, filename)
ff
#> [1] "/var/folders/b7/_6hwb39d43l71kpy59b_clhr0000gn/T//RtmpACJYNv/filedf6b65d4fca4_1.tif"
#> [2] "/var/folders/b7/_6hwb39d43l71kpy59b_clhr0000gn/T//RtmpACJYNv/filedf6b65d4fca4_2.tif"
#> [3] "/var/folders/b7/_6hwb39d43l71kpy59b_clhr0000gn/T//RtmpACJYNv/filedf6b65d4fca4_3.tif"
#> [4] "/var/folders/b7/_6hwb39d43l71kpy59b_clhr0000gn/T//RtmpACJYNv/filedf6b65d4fca4_4.tif"

Now, we'll read them in as a VRT, again, straight from the same example. This allows

vrtfile <- paste0(tempfile(), ".vrt")
v <- vrt(ff, vrtfile)
head(readLines(vrtfile))
#> [1] "<VRTDataset rasterXSize=\"100\" rasterYSize=\"100\">"
#> [2] " <SRS dataAxisToSRSAxisMapping=\"2,1\">GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]</SRS>"
#> [3] " <GeoTransform> -1.8000000000000000e+02, 3.6000000000000001e+00, 0.0000000000000000e+00, 9.0000000000000000e+01, 0.0000000000000000e+00, -1.8000000000000000e+00</GeoTransform>"
#> [4] " <VRTRasterBand dataType=\"Float32\" band=\"1\">"
#> [5] " <NoDataValue>nan</NoDataValue>"
#> [6] " <ColorInterp>Gray</ColorInterp>"
v
#> class : SpatRaster
#> dimensions : 100, 100, 1 (nrow, ncol, nlyr)
#> resolution : 3.6, 1.8 (x, y)
#> extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source : filedf6b216a737.vrt
#> name : filedf6b216a737
#> min value : 1
#> max value : 10000

Now, we can just create a simple example polygon to restrict our region of interest from -180 to 180 to -90 to 90 longitude.

library(sf)

pl <- list(rbind(c(-90,-90), c(-90,90), c(90,90), c(90,-90), c(-90,-90)))
roi <- st_sfc(st_polygon(pl), crs = "EPSG:4326")

crop(v, roi)
#> class : SpatRaster
#> dimensions : 100, 50, 1 (nrow, ncol, nlyr)
#> resolution : 3.6, 1.8 (x, y)
#> extent : -90, 90, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source : memory
#> name : filedf6b216a737
#> min value : 26
#> max value : 9975

Save simulated datasets individually (speed + memory limit)

As mentioned by F.Privé, if you need to save those files, better use saveRDS. In that case you are not doing redundant saving and loading.

jj <- 1:2000
for(i in 1:10){
for(j in jj){
dataA <- cbind(rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j))
dataB <- cbind(rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j),rnorm(j))
data_list[[j]] <- dataA-dataB
}
saveRDS(data_list, paste0("Data", i, '.rds'))
}

As of this particular data simulation, I would try to avoid loops. Generating all the data at once (or at least in parts) and then storing to data.frame with index column. Something like:

dataA <- replicate(8, rnorm(sum(jj)))
dataB <- replicate(8, rnorm(sum(jj)))
data_list <- dataA - dataB
data <- as.data.frame(data_list)
data[, "ind"] <- rep(jj, times = jj)

But as I assume this is not your real data simulation, it is crucial to understand why you are simulating 1k list of 2k data sets? Do they all need to be in separated lists? All are simulated equally? and so on...



Related Topics



Leave a reply



Submit