R Programming: Read.Csv() Skips Lines Unexpectedly

R Programming: read.csv() skips lines unexpectedly

Here's an example of using count.fields to determine where to look and perhaps apply fixes. You have a modest number of lines that are 23 'fields' in width:

> table(count.fields("~/Downloads/bugs.csv", quote="", sep=","))
2 23 30
502 10 136532
> table(count.fields("~/Downloads/bugs.csv", sep=","))
# Just wanted to see if removing quote-recognition would help.... It didn't.
2 4 10 12 20 22 23 25 28 30
11308 24 20 33 642 251 10 2 170 124584
> which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)
[1] 104843 125158 127876 129734 130988 131456 132515 133048 136764
[10] 136765

I looked at the 23 with:

txt <-readLines("~/Downloads/bugs.csv")[
which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)]

And they had octothorpes ("#", hash-signs) which are comment characters in R data parlance.

> table(count.fields("~/Downloads/bugs.csv", quote="", sep=",", comment.char=""))
30
137044

So.... use those settings in read.table and you should be "good to go".

R skips lines from /dev/stdin

head --bytes=4K file | tail -n 3

yields this:

1039
1040
104

This suggests that R creates an input buffer on /dev/stdin, of size 4KB, and fills it during initialisation. When your R code then reads /dev/stdin, it starts in file at this point:

   1
1042
1043
...

Indeed, if in file you replace the line 1041 by 1043, you get a "3" instead of "1" in the table(x):

3  1042  1043  1044  1045  1046  1047  1048  1049  1050  1051  1052  1053 
1 1 1 1 1 1 1 1 1 1 1 1 1
...

The first 1 in table(x) is actually the last digit of 1041. The first 4KB of file have been eaten.

Reading in multiple CSVs with different numbers of lines to skip at start of file

The function fread from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.

Here is an example code:

require(data.table)

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

lapply(list.files(pattern = "myfile.*.csv"), fread)

read.csv in R doesn't import all rows from csv file

The OP indicates that the problem is caused by quotes in the CSV-file.

When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote="" option in read.csv. This disables quotes.

data <- read.csv(filename, quote="")

Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.

lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))

A slightly more safe solution, which will only delete quotes when not just before or after a comma:

lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))

Error reading csv as a zoo object - certain lines with 'bad entries'

Your csv has empty values. You can fill with NAs and then turn into a zoo object. You could try this:

x<- read.csv("OakParkR.csv", header=TRUE)
na.fill(x,NA)
x<- zoo(x)
x[33:35]
#date imax Tmax imin Tmin irain rain cbl wdsp ihm hm iddhm ddhm ihg hg soil
#33 02-Feb-07 0 9.1 0 -1.7 0 0.1 1026.2 3.9 0 10 0 340 0 14 5.970
#34 03-Feb-07 0 9.2 0 -3.0 0 0.0 <NA> 2.4 0 7 0 130 0 11 3.101
#35 04-Feb-07 0 7.7 0 -3.7 0 0.0 1031.8 3.3 0 8 0 330 0 12 2.668


Related Topics



Leave a reply



Submit