Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
Sometimes I do something like this when I have the data in a tab-delimited file:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
That lets cut
do the data selection, which it can do without using much memory at all.
See Only read limited number of columns for pure R version, using "NULL"
in the colClasses
argument to read.table
.
Only read selected columns
Say the data are in file data.txt
, you can use the colClasses
argument of read.table()
to skip columns. Here the data in the first 7 columns are "integer"
and we set the remaining 6 columns to "NULL"
indicating they should be skipped
> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)),
+ header = TRUE)
Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27 -2 -6 -10 -32
Change "integer"
to one of the accepted types as detailed in ?read.table
depending on the real type of data.
data.txt
looks like this:
$ cat data.txt
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29
and was created by using
write.table(dat, file = "data.txt", row.names = FALSE)
where dat
is
dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L,
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L,
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L,
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))
If the number of columns is not known beforehand, the utility function count.fields
will read through the file and count the number of fields in each line.
## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))
Read only select columns with read.table when number of columns is unknown
It is easy to know how many columns you have if you know your separator. You can use a construct such as this for each file:
my.read.table <- function (file, sep=",", colClasses3=rep('double', 3), ...) {
first.line <- readLines(file, n=1)
## Split the first line on the separator.
ncols <- length(strsplit(first.line, sep, fixed=TRUE)[[1]])
## fixed=TRUE is to avoid the need to escape the separator when splitting.
out <- read.table(file, sep=sep,
colClasses=c(colClasses3, rep("NULL", ncols - 3)), ...)
out
}
And then use your solution:
lapply(files, my.read.table, skip=19, header=TRUE)
Also, note that you will have to worry about whether you have rownames and colnames in your file or not because of some intelligence that read.table applies when rownames and colnames are present. The above solution is written assuming none. Please read about colClasses
in ?read.table
to tweak this further to suit your needs.
Read only certain columns from xls
You can use library XLConnect
to read .xls files. Function readWorksheet()
lets you set columns and rows you need to import.
library(XLConnect)
wb<-loadWorkbook("wb.xls")
data <- readWorksheet(wb, sheet = "Sheet1",startCol=1,endCol=7)
How to read only lines that fulfil a condition from a csv into R?
You could use the read.csv.sql
function in the sqldf
package and filter using SQL select. From the help page of read.csv.sql
:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
Selecting non-consecutive columns in R tables
You simply first generate the indexes you want. The c
function allows you to concatenate values. The values can be either column indices or column names (but not mixed).
df <- data.frame(matrix(runif(100), 10))
cols <- c(1, 4:8, 10)
df[,cols]
You can also select which column indices to remove by specifying a negative index:
df[, -c(3, 5)] # all but the third and fifth columns
Related Topics
Selection of Activity Trace in a Chart and Display in a Data Table in R Shiny
Remove Data.Frame Row Names When Using Xtable
Ggplot Geom_Point() with Colors Based on Specific, Discrete Values
Fastest Way for Filling-In Missing Dates for Data.Table
Convert Matrix to Three Column Data.Frame
Detect Non Ascii Characters in a String
R Markdown: How to Make Text Float Around Figures
Read CSV File Hosted on Google Drive
Interpolate Zoo Object with Missing Dates
How to Close Unused Connections After Read_HTML in R
Developing Geographic Thematic Maps with R
Transform Only One Axis to Log10 Scale with Ggplot2
Clipping Raster Using Shapefile in R, But Keeping the Geometry of the Shapefile
Using Prophet Package to Predict by Group in Dataframe in R
Create a Formula in a Data.Table Environment in R
How to Generate Ascii "Graphical Output" from R
Using Lapply with Changing Arguments
How to Automate Multiple Requests to a Web Search Form Using R