How to Read Specific Rows of CSV File with Fread Function

Read specific, non-consecutive rows using data.table::fread (equivalent to the “Select” argument, but for rows)?

One method (although a little brute-force) is to use sed to cut the lines.

Recall that fread takes file= as well as cmd=, as in

library(data.table)
fwrite(iris, "iris.csv")
fread(cmd = "head -n 3 iris.csv")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa

(Two rows since head doesn't know/care about the header row.)

Try this:

want_rows <- c(1, 3, 147:149)
# due to the header row, add the header and 1 to each of want
paste0(c(1, 1+want_rows), "p")
# [1] "1p" "2p" "4p" "148p" "149p" "150p"
writeLines(paste0(c(1, 1+want_rows), "p"), "commands.sed")

fread(cmd = "sed -n -f commands.sed iris.csv")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.7 3.2 1.3 0.2 setosa
# 3: 6.3 2.5 5.0 1.9 virginica
# 4: 6.5 3.0 5.2 2.0 virginica
# 5: 6.2 3.4 5.4 2.3 virginica
iris[want_rows,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 147 6.3 2.5 5.0 1.9 virginica
# 148 6.5 3.0 5.2 2.0 virginica
# 149 6.2 3.4 5.4 2.3 virginica

If you have significant "ranges", then you could optimize this a little for sed, to have an effective command line of sed -ne '1p;2p;4p;148,150p' for the same effect.

There is another method ala "every so many rows" listed here: https://www.thegeekstuff.com/2009/09/unix-sed-tutorial-printing-file-lines-using-address-and-patterns/. I don't know how tightly you can control this (every nth row starting from some arbitrary number, for instance). I don't know that this is your intent or need, though, it sounds like will have arbitrary line numbers.

data.table fread, how to read in csv file last N rows

You can have fread read the last n rows of a .csv using cmd param

fread(cmd="tail -10 iris.csv")

If you want the header information also, you can wrap the above in a call to data.table::setnames, where the names are obtained by reading the file without any rows (nrows=0):

setnames(
fread(cmd="tail -10 iris.csv"),
names(fread("iris.csv",nrows = 0))
)

If you have full path, you can do this:

fpath = "C:/Users/xxxx/Desktop/folder 1/iris.csv"

setnames(
fread(cmd=paste0("tail -10 '", fpath, "'")),
names(fread(fpath,nrows = 0))
)[]

How can I use do.call() to read specific rows?

There are several strategies on how to tackle this. You can import all the data into a list using lapply and then from each list element filter out based on your filter. You would use data.table::rbindlist to make the final data.table. Another one would be to do this in one step, e.g. (not tested, obviously)

library(data.table)

files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)

xy <- lapply(files, FUN = function(x) {
out <- fread(x)
out <- filter(out, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
out
})

xy <- rbindlist(xy)

Using fread() to select rows and columns, the way read.csv.sql() does

It is not possible to select rows with fread() as with read.csv.sql() yet. But it is still better to read the entire data (memory permitting) and then subset it as per your criteria. For a 200 mb file, fread()+ subset() gave ~ 4 times better performance than read.csv.sql().

So, using @Arun's suggestion,

ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x]))
subset(ans, 'your criteria')

is better than the approach in the original question.



Related Topics



Leave a reply



Submit