How to read a subset of large dataset in R?
Use skip=
parameter in read.table
read.table("file.txt",skip= ,nrows= )
Both the skip=
and nrows=
take in row indicator numbers so just add them after the=.
The nrows=
defines how deep you range when you are importing the file.
I suggest reading https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you haven't done so already.
Also, please see one of my questions:
R - Reading lines from a .txt-file after a specific line
It, somewhat, touches the same subject.
The other possible way might be to use grep()
in skip=
read.table(...,skip=grep("2005-12-31", readLines("File.txt")),nrows=365)
What this line does is it skips until it finds the line depicted in grep()
and reads the lines after that. The nrow=
will stop the reading after it has read 365 lines (this way you have read one year of dates provided one line equals one date).
This seems kinda complicated, but it's the only way I know how to solve this.
Subsetting a large dataset in R by Months
The documentation for selectByDate()
states that the first argument is
A data frame containing a
date
field in hourly or high resolution format.
That means that you need two changes to your code.
Firstly, the Date
field needs to be named date
(with a lower case d
). you can do this when you convert from character. (paste()
isn't doing anything here so you can get rid of it.)
COVIDcases$date <- as.Date(COVIDcases$Date, "%m/%d/%y")
Secondly, you need to pass the whole of your data frame, not just that one column.
JanuaryCasesdata <- selectByDate(COVIDcases, start = "2020-01-01", end = "2020-01-31")
Fast subset/lookup/filter in large datasets
If you can call many tuples of values in parallel instead of sequentially...
set.seed(1)
DF <- data.frame(factorA = rep(letters[1:3], 100000),
factorB = sample(rep(letters[1:3], 100000),
3*100000, replace = FALSE),
numC = round(rnorm(3*100000), 2),
numD = round(rnorm(3*100000), 2))
library(data.table)
DT = data.table(DF)
f = function(vA, vB, nC, nD, dat = DT){
rs <- dat[.(vA, vB, nC), on=.(factorA, factorB, numC), roll="nearest",
.(g = .GRP, r = .I, numD), by=.EACHI][.(seq_along(vA), nD), on=.(g, numD), roll="nearest", mult="first",
r]
df[rs]
}
# example usage
mDT = data.table(vA = c("a", "b"), vB = c("c", "c"), nC = c(.3, .5), nD = c(.6, .8))
mDT[, do.call(f, .SD)]
# factorA factorB numC numD
# 1: a c 0.3 0.60
# 2: b c 0.5 0.76
Comparing with the other solutions that must be run rowwise...
# check the results match
library(magrittr)
dt = copy(DT)
mDT[, closest3(vA, vB, nC, nD), by=.(mr = seq_len(nrow(mDT)))]
# mr factorA factorB numC numD
# 1: 1 a c 0.3 0.60
# 2: 2 b c 0.5 0.76
# check speed for a larger number of comparisons
nr = 100
system.time( mDT[rep(1:2, each=nr), do.call(f, .SD)] )
# user system elapsed
# 0.07 0.00 0.06
system.time( mDT[rep(1:2, each=nr), closest3(vA, vB, nC, nD), by=.(mr = seq_len(nr*nrow(mDT)))] )
# user system elapsed
# 10.65 2.30 12.60
How it works
For each tuple in .(vA, vB, nC)
, we look up rows that match vA
and vB
exactly and then "roll" to the nearest value of nC
-- this doesn't quite match the OP's rule (of looking within a bound of nC*[0.9, 1.1]), but that rule could easily be applied after-the-fact. For each match, we take the tuple's "group number," .GRP
, the row numbers that were matched, and the values of numD
on those rows.
Then we join on group number and nD
, matching exactly on the former and rolling to nearest on the latter. If there are multiple nearest matches, we take the first with mult="first"
.
We can then take the row number of each tuple's match and look it up in the original table.
Performance
So the vectorized solution seems to have a big performance benefit, as usual with R.
If you can only pass ~5 tuples at a time (as for the OP) instead of 200, there will still probably be benefits from this approach vs which.min
and similar, thanks to binary search, as @F.Privé suggested in a comment.
As noted in @HarlanNelson's answer, adding indices to the table might further improve performance. See his answer and ?setindex
.
Fix for numC rolling to one value
Thanks to the OP for identifying this problem:
DT2 = data.table(id = "A", numC = rep(c(1.01,1.02), each=5), numD = seq(.01,.1,.01))
DT2[.("A", 1.011), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 1.011 0.05
Here, we see one row, but we should be seeing five. One fix (though I'm not sure why) is converting to integers:
DT3 = copy(DT2)
DT3[, numC := as.integer(numC*100)]
DT3[, numD := as.integer(numD*100)]
DT3[.("A", 101.1), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 101 1
# 2: A 101 2
# 3: A 101 3
# 4: A 101 4
# 5: A 101 5
R: Read in a subset of lines and turn it into a conventional format (data.table approach preferred)
Let's replicate the process till fread()
import:
# your example string
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
# import
library(data.table)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 5)
lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
4: My\tname\tis\tEcho
When you try
as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
it converts all data.table
in character, so each column will be a concatenated vector. See below:
as.character(data.table(lines$V1, lines$V1))
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
[2] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
What you want is extract just lines$V1
, which is already a character vector.
lines$V1
[1] "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta" "My\tname\tis\tEcho"
Related Topics
How to Install R Packages via Proxy [User + Password]
Determining Minimum Values in a Vector in R
Compute Only Diagonals of Matrix Multiplication in R
Disconnected from Server in Shinyapps, But Local's Working
Plot Only One Side/Half of the Violin Plot
How to Select All Factor Variables in R
How to Always Suppress Messages in R
Creating a More Continuous Color Palette in R, Ggplot2, Lattice, or Latticeextra
Adding Multiple Shadows/Rectangles to Ggplot2 Graph
Exporting Multiple Panels of Plots and Data to *.Png (In the Style Layout() Works Within R)
Extracting Output from Principal Function in Psych Package as a Data Frame
Obtaining Percent Scales Reflective of Individual Facets with Ggplot2
How to Hide/Toggle Legends Based on Addlayercontrol() in Leaflet for R