Dealing with Readlines() Function in R

Dealing with readLines() function in R

Suppose txt is the text from line 1 of your data that you read in with readLines.

Then if you want to split it into separate strings, each of which is a word, then you can use strsplit, splitting at the space between each word.

> txt <- paste0(letters[1:10], LETTERS[1:10], collapse = " ")
> txt
## [1] "aA bB cC dD eE fF gG hH iI jJ" ## character vector of length 1
> length(txt)
[1] 1
> newTxt <- unlist(strsplit(txt, split = "\\s")) ## split the string at the spaces
> newTxt
## [1] "aA" "bB" "cC" "dD" "eE" "fF" "gG" "hH" "iI" "jJ"
## now the text is a character vector of length 10
> length(newTxt)
[1] 10

readLines taking up too much storage

Operating off of test.csv from your previous (since deleted) question, there can be a stark difference after conversion to numeric.

For the record, the file looks like

996; 1160.32; 1774.51; 4321.05; 2530.97; 2817.63; 1796.18; ...
1008; 1774.51; 1796.18; 1192.42; 1285.69; 1225.96; 2229.92; ...
1020; 1796.18; 1285.69; 711.67; 1761.44; 1016.74; 1671.90; ...
1032; 1285.69; 1761.44; 1016.74; 1671.90; 725.51; 2466.49; ...
1044; 1761.44; 1016.74; 725.51; 2466.49; 661.82; 1378.85; ...
1056; 1761.44; 1016.74; 2466.49; 661.82; 1378.85; 972.94; ...
1068; 2466.49; 661.82; 1378.85; 972.94; 2259.46; 3648.49; ...
1080; 2466.49; 1378.85; 972.94; 2259.46; 1287.72; 1074.63; ...

though the real test.csv has 751 lines of text, and each line has between 10001-10017 ;-delimited fields. This (unabridged) file is just under 64 MiB.

Reading it in, parsing it, and then converting to numbers has a dramatic effect on its object sizes:

object.size(aa1 <- readLines("test.csv"))
# Warning in readLines("test.csv") :
# incomplete final line found on 'test.csv'
# 67063368 bytes

object.size(aa2 <- strsplit(aa1, "[; ]+"))
# 476021832 bytes

object.size(aa3 <- lapply(aa2, as.numeric))
# 60171040 bytes

and we end up with:

length(aa3)
# [1] 751

str(aa3[1:4])
# List of 4
# $ : num [1:10006] 996 1160 1775 4321 2531 ...
# $ : num [1:10008] 1008 1775 1796 1192 1286 ...
# $ : num [1:10009] 1020 1796 1286 712 1761 ...
# $ : num [1:10012] 1032 1286 1761 1017 1672 ...

So reading it in into full-length strings isn't what explodes the memory, it's splitting it into 10000+ fields per line that does you in. This is because there's a larger overhead per character object:

### vec-1, nchar-0
object.size("")
# 112 bytes

### vec-5, nchar-0
object.size(c("","","","",""))
# 152 bytes

### vec-5, nchar-10
object.size(c("Dealing with Readlines() Function in Raa","Dealing with Readlines() Function in Raa","Dealing with Readlines() Function in Raa","Dealing with Readlines() Function in Raa","Dealing with Readlines() Function in Raa"))
# 160 bytes

If we look at the original data, we'll see this exploding:

object.size(aa1[1])   # whole lines at a time, so one line is 10000+ characters
# 89312 bytes
object.size(aa2[[1]]) # vector of 10000+ strings, each between 3-8 characters
# 633160 bytes

But fortunately, numbers are much smaller in memory:

object.size(1)
# 56 bytes
object.size(c(1,2,3,4,5))
# 96 bytes

and it scales much better. Obviously, enough better to reduce the data from 453MiB (split, strings) to 57MiB (split, numbers) in R's storage.


You will still see a bloom in R's memory usage when reading in these files. You can try to reduce it by converting to numbers immediately after strsplit; to be honest, I don't know how quickly R's garbage collector (a common thing for high-level programming languages) will return the memory, nor am I certain how this will behave in light of R's "global string pool". But if you are interested, you can try this adaptation of your function.

func <- function(path) {
aa1 <- readLines(path)
aa2 <- lapply(aa1, function(st) as.numeric(strsplit(st, "[; ]+")[[1]]))
aa2
}

(I make no promises that it will not still "bloom" your memory usage, but perhaps it's a little less-bad.)

And then the canonical replacement for your for loop (though that loop is fine), is

dat <- lapply(files, func)

readLines function with new version of R

My findings using various docker images:

  • The example works fine using R version 3.4.4 (2018-03-15) -- "Someone to Lean On" from rocker/r-ver:3.4.4.
  • The example hangs as described using R version 3.5.0 (2018-04-23) -- "Joy in Playing" from rocker/r-ver:3.5.0.
  • The example hangs as described using R Under development (unstable) (2018-05-19 r74746) -- "Unsuffered Consequences" from rocker/drd.

It looks as if the change mentioned in the release notes for version 3.5.1 is unrelated. I have sent my findings to r-devel and will report back the outcome:

  • The example works fine using R version 3.5.1 (2018-07-02) -- "Feather Spray"
  • The bug has been marked as fixed. I can assert that version R Under development (unstable) (2018-06-02 r74838) -- "Unsuffered Consequences" works as expected.

  • This is considered a bug, but it's unclear how and when it will be fixed.

  • A reasonable workaround: Send end-of-file (EOF, Ctrl-D) in addition to end-of-line.

readLines killing R in purrr::map

this will work:

result <- map(paths, readLines, n = 1)

from `?purrr::map

Usage
map(.x, .f, ...)
... Additional arguments passed on to .f.

R and readLines of webpage text

Here's one approach to read this in (using two packages I maintain and the terrific stacksplitshape package) . You'll need the dev version of qdapTools.

devtools::install_github("trinker/qdapTools")
library(qdapTools); library(qdapRegex); library(splitstackshape)
url<-"http://www.arrs.net/MaraList/ML_2014.htm"

m <- readLines(url)[-c(1:7, 2760:2767)]

## Split into lists by country
x <- loc_split(m, unique(grep("<B><FONT", m)))

## Clean up country names
nms <- rm_angle(sapply(x, `[`, 1))

## remove html country name from data can convert to a data.frame
dat <- list2df(setNames(lapply(x, `[`, -1), nms), "dats", "Country")[, 2:1]

## Use hand parsing technique to locate widths
## I added a # before each column in row one of data
## gregexpr tells us the location of the # characters
det <- "AAR #26#Jan #King George Island # #27+25 #White Continent #4:03:30 #Steve Hibbs (USA) #4:13:02 #Suzy Seeley (54,TX/USA) "
widths <- gregexpr("#", det)[[1]]

## replace those widths with # character as it is not any where else in data set
for (i in widths){
substring(dat[["dats"]], i, i) <- "#"
}

## split columns on # character
out <- cSplit(dat, 2, sep="#")

out

ReadLines using multiple sources in R

As all of these documents are csv files with 38 columns you can combine then very easily using:

MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)

raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)

What happens here and how is this looping?
The lapply function basically creates a list with 3 (= length(urls)) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE). So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.

The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)].

If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:

write.csv(dat, "[Directory]/ScrapeTest.txt")


Related Topics



Leave a reply



Submit