set.seed with R 2.15.2
set.seed()
reinitializes the random number generator.
set.seed(12345)
rnorm(5)
[1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
set.seed(12345)
rnorm(5)
[1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
set.seed(12345)
rnorm(5)
[1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
R: Strange behavior while saving list() with save() from function output
look at
> attr(ok[[1]]$terms,".Environment")
<environment: 0x9bcf3f8>
> attr(ok2[[1]]$terms,".Environment")
<environment: R_GlobalEnv>
also
> ls(envir = attr(ok[[1]]$terms,".Environment"))
[1] "i" "k" "tt"
so ok
is dragging around the environment of the function with it.
Also read ?object.size
The calculation is of the size of the object, and excludes the
space needed to store its name in the symbol table.
Associated space (e.g. the environment of a function and what the
pointer in a ‘EXTPTRSXP’ points to) is not included in the
calculation.
For example define a test2
and an ok3
test2 = function(k){
tt = vector('list',k)
for(i in 1:k) tt[[i]] = lm(a0~b1+b2+b3,data = data)
rr = tt
tt
}
ok3 <- test2(2)
save(ok3, 'ok3.RdData')
> file.info('ok3.RData')$size
[1] 5043933
> file.info('ok.RData')$size
[1] 3366005
> file.info('ok2.RData')$size
[1] 1678851
> ls(envir = attr(ok3[[1]]$terms,".Environment"))
[1] "i" "k" "rr" "tt"
so ok
is roughly twice as big as ok2
because it has the extra tt
and ok3
is three times as big as it has tt
and rr
> c(object.size(ok),object.size(ok2),object.size(ok3))
[1] 4019336 4019336 4019336
There is related discussion here
table() generating NAs when there are no NAs in the underlying data
After installing the data.table package and doing some preliminaries...
require(data.table)
n0<- 1e5
n <- 1e6
DT <- data.table(A1 = sample(1:n0, n, replace = TRUE),B1 = sample(1:n0, n, replace = TRUE))
this does the trick.
setkey(DT,A1)
DT[
DT[,.N,by=A1],
countC:=N
]
When you access a data.table with DT[i,j]
, you can select rows with i
and do something else with j
, just like in data.frames.
DT[,.N,by=A1]
selects all rows (since i
is blank) and counts rows for each "A1" using the special variable .N
.
After setting column "A1" as key for DT, we can pass a data.table -- in this case DT[,.N,by=A1]
-- in i
to merge back the information in the latter data.table. In j
, we create a new column in DT using countC:=N
. The three vignettes on data.table's CRAN page are a good place to start learning more about how this works.
The question at hand. Oh, I think I see what the original problem was. Suppose unique(x)=c(1,2,4)
. If you try table(x)[x]
, you will be trying to access table(x)[1]
, table(x)[2]
and table(x)[4]
. The last one is undefined since the length of the table is only 3. R always returns NA
when we access indices greater than the length of a vector. For example, look at (1:3)[4]
.
In your case, if you are missing any unique values in 1:n0
that are not at the very top, you will see NA
s.
Assignment by reference with sum() in data.table() yields incorrect result
This is not a problem with data.table
, but rather, human error ;)
To replicate, here is some sample data. I've included some NA
values to see the results of the sum
function with and without the argument to remove NA
s, which is na.rm
, not na.remove
:
set.seed(1)
test <- data.table(Year = rep("Y1", 15),
ID = c(rep(210, 9), rep(3197, 6)),
Count = sample(c(0, 1, NA), 15,
prob=c(.2, .65, .15),
replace=TRUE),
key = "Year,ID")
test
# Year ID Count
# 1: Y1 210 1
# 2: Y1 210 1
# 3: Y1 210 1
# 4: Y1 210 NA
# 5: Y1 210 1
# 6: Y1 210 NA
# 7: Y1 210 NA
# 8: Y1 210 0
# 9: Y1 210 1
# 10: Y1 3197 1
# 11: Y1 3197 1
# 12: Y1 3197 1
# 13: Y1 3197 0
# 14: Y1 3197 1
# 15: Y1 3197 0
Before we create our new column, let's just do some aggregation to see what happens with the different options for sum
.
test[, list(annualCount = sum(Count)), by = key(test)]
# Year ID annualCount
# 1: Y1 210 NA
# 2: Y1 3197 4
test[, list(annualCount = sum(Count, na.rm = TRUE)), by = key(test)]
# Year ID annualCount
# 1: Y1 210 5
# 2: Y1 3197 4
Now, create your new column, with the results you expected.
test[, annualCount := sum(Count, na.rm = TRUE), by = key(test)][]
# Year ID Count annualCount
# 1: Y1 210 1 5
# 2: Y1 210 1 5
# 3: Y1 210 1 5
# 4: Y1 210 NA 5
# 5: Y1 210 1 5
# 6: Y1 210 NA 5
# 7: Y1 210 NA 5
# 8: Y1 210 0 5
# 9: Y1 210 1 5
# 10: Y1 3197 1 4
# 11: Y1 3197 1 4
# 12: Y1 3197 1 4
# 13: Y1 3197 0 4
# 14: Y1 3197 1 4
# 15: Y1 3197 0 4
Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Related Topics
R Markdown Add Tag to Head of HTML Output
Getting The Name of a Dataframe from Loading a .Rda File in R
R Shiny: How to Change The Background Color of The Header
Dynamically Formatting Individual Axis Labels in Ggplot2
Integrate() Gives Totally Wrong Number
Download Multiple CSV Files with One Button (Downloadhandler) with R Shiny
Make a Boxplot Without Whiskers
Verify Object Existence Inside a Function in R
Convert 12Hour Time to 24Hour Time
Ggplot: Line Plot for Discrete X-Axis
Read List of File Names from Web into R
Calculate a 2D Spline Curve in R
Create Group Based on Fuzzy Criteria
Benchmarking: Using 'Expression' 'Quote' or Neither
How to Install Doredis Package Version 1.0.5 into R 3.0.1 on Windows
Ggplot2 Violin Plot: Fill Central 95% Only
When/How/Where Is Parent.Frame in a Default Argument Interpreted