Ways to Read Only Select Columns from a File into R? (A Happy Medium Between 'Read.Table' and 'Scan')

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)

Sometimes I do something like this when I have the data in a tab-delimited file:

df <- read.table(pipe("cut -f1,5,28 myFile.txt"))

That lets cut do the data selection, which it can do without using much memory at all.

See Only read limited number of columns for pure R version, using "NULL" in the colClasses argument to read.table.

Only read selected columns

Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped

> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), 
+ header = TRUE)
Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27 -2 -6 -10 -32

Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.

data.txt looks like this:

$ cat data.txt 
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29

and was created by using

write.table(dat, file = "data.txt", row.names = FALSE)

where dat is

dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, 
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L,
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L,
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))

If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.

## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))

Read only select columns with read.table when number of columns is unknown

It is easy to know how many columns you have if you know your separator. You can use a construct such as this for each file:

my.read.table <- function (file, sep=",", colClasses3=rep('double', 3), ...) {

first.line <- readLines(file, n=1)

## Split the first line on the separator.

ncols <- length(strsplit(first.line, sep, fixed=TRUE)[[1]])
## fixed=TRUE is to avoid the need to escape the separator when splitting.

out <- read.table(file, sep=sep,
colClasses=c(colClasses3, rep("NULL", ncols - 3)), ...)

out
}

And then use your solution:

lapply(files, my.read.table, skip=19, header=TRUE)

Also, note that you will have to worry about whether you have rownames and colnames in your file or not because of some intelligence that read.table applies when rownames and colnames are present. The above solution is written assuming none. Please read about colClasses in ?read.table to tweak this further to suit your needs.

Read only certain columns from xls

You can use library XLConnect to read .xls files. Function readWorksheet() lets you set columns and rows you need to import.

library(XLConnect)
wb<-loadWorkbook("wb.xls")
data <- readWorksheet(wb, sheet = "Sheet1",startCol=1,endCol=7)

How to read only lines that fulfil a condition from a csv into R?

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")

Selecting non-consecutive columns in R tables

You simply first generate the indexes you want. The c function allows you to concatenate values. The values can be either column indices or column names (but not mixed).

df <- data.frame(matrix(runif(100), 10))
cols <- c(1, 4:8, 10)
df[,cols]

You can also select which column indices to remove by specifying a negative index:

df[, -c(3, 5)] # all but the third and fifth columns


Related Topics



Leave a reply



Submit