How to Read a Subset of Large Dataset in R

How to read a subset of large dataset in R?

Use skip= parameter in read.table

read.table("file.txt",skip= ,nrows= )

Both the skip= and nrows= take in row indicator numbers so just add them after the=.

The nrows= defines how deep you range when you are importing the file.

I suggest reading https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you haven't done so already.

Also, please see one of my questions:

R - Reading lines from a .txt-file after a specific line

It, somewhat, touches the same subject.

The other possible way might be to use grep() in skip=

read.table(...,skip=grep("2005-12-31", readLines("File.txt")),nrows=365)

What this line does is it skips until it finds the line depicted in grep() and reads the lines after that. The nrow= will stop the reading after it has read 365 lines (this way you have read one year of dates provided one line equals one date).

This seems kinda complicated, but it's the only way I know how to solve this.

Subsetting a large dataset in R by Months

The documentation for selectByDate() states that the first argument is

A data frame containing a date field in hourly or high resolution format.

That means that you need two changes to your code.

Firstly, the Date field needs to be named date (with a lower case d). you can do this when you convert from character. (paste() isn't doing anything here so you can get rid of it.)

COVIDcases$date <- as.Date(COVIDcases$Date, "%m/%d/%y")

Secondly, you need to pass the whole of your data frame, not just that one column.

JanuaryCasesdata <- selectByDate(COVIDcases, start = "2020-01-01", end = "2020-01-31")

Fast subset/lookup/filter in large datasets

If you can call many tuples of values in parallel instead of sequentially...

set.seed(1)
DF <- data.frame(factorA = rep(letters[1:3], 100000),
factorB = sample(rep(letters[1:3], 100000),
3*100000, replace = FALSE),
numC = round(rnorm(3*100000), 2),
numD = round(rnorm(3*100000), 2))

library(data.table)
DT = data.table(DF)

f = function(vA, vB, nC, nD, dat = DT){

rs <- dat[.(vA, vB, nC), on=.(factorA, factorB, numC), roll="nearest",
.(g = .GRP, r = .I, numD), by=.EACHI][.(seq_along(vA), nD), on=.(g, numD), roll="nearest", mult="first",
r]

df[rs]
}

# example usage
mDT = data.table(vA = c("a", "b"), vB = c("c", "c"), nC = c(.3, .5), nD = c(.6, .8))

mDT[, do.call(f, .SD)]

# factorA factorB numC numD
# 1: a c 0.3 0.60
# 2: b c 0.5 0.76

Comparing with the other solutions that must be run rowwise...

# check the results match
library(magrittr)
dt = copy(DT)
mDT[, closest3(vA, vB, nC, nD), by=.(mr = seq_len(nrow(mDT)))]

# mr factorA factorB numC numD
# 1: 1 a c 0.3 0.60
# 2: 2 b c 0.5 0.76

# check speed for a larger number of comparisons

nr = 100
system.time( mDT[rep(1:2, each=nr), do.call(f, .SD)] )
# user system elapsed
# 0.07 0.00 0.06

system.time( mDT[rep(1:2, each=nr), closest3(vA, vB, nC, nD), by=.(mr = seq_len(nr*nrow(mDT)))] )
# user system elapsed
# 10.65 2.30 12.60

How it works

For each tuple in .(vA, vB, nC), we look up rows that match vA and vB exactly and then "roll" to the nearest value of nC -- this doesn't quite match the OP's rule (of looking within a bound of nC*[0.9, 1.1]), but that rule could easily be applied after-the-fact. For each match, we take the tuple's "group number," .GRP, the row numbers that were matched, and the values of numD on those rows.

Then we join on group number and nD, matching exactly on the former and rolling to nearest on the latter. If there are multiple nearest matches, we take the first with mult="first".

We can then take the row number of each tuple's match and look it up in the original table.

Performance

So the vectorized solution seems to have a big performance benefit, as usual with R.

If you can only pass ~5 tuples at a time (as for the OP) instead of 200, there will still probably be benefits from this approach vs which.min and similar, thanks to binary search, as @F.Privé suggested in a comment.

As noted in @HarlanNelson's answer, adding indices to the table might further improve performance. See his answer and ?setindex.

Fix for numC rolling to one value

Thanks to the OP for identifying this problem:

DT2 = data.table(id = "A", numC = rep(c(1.01,1.02), each=5), numD = seq(.01,.1,.01))
DT2[.("A", 1.011), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 1.011 0.05

Here, we see one row, but we should be seeing five. One fix (though I'm not sure why) is converting to integers:

DT3 = copy(DT2)
DT3[, numC := as.integer(numC*100)]
DT3[, numD := as.integer(numD*100)]
DT3[.("A", 101.1), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 101 1
# 2: A 101 2
# 3: A 101 3
# 4: A 101 4
# 5: A 101 5

R: Read in a subset of lines and turn it into a conventional format (data.table approach preferred)

Let's replicate the process till fread() import:

# your example string
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"

# import
library(data.table)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 5)
lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
4: My\tname\tis\tEcho

When you try

as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"

it converts all data.table in character, so each column will be a concatenated vector. See below:

as.character(data.table(lines$V1, lines$V1))
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
[2] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"

What you want is extract just lines$V1, which is already a character vector.

lines$V1
[1] "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta" "My\tname\tis\tEcho"


Related Topics



Leave a reply



Submit