R Programming: read.csv() skips lines unexpectedly
Here's an example of using count.fields
to determine where to look and perhaps apply fixes. You have a modest number of lines that are 23 'fields' in width:
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=","))
2 23 30
502 10 136532
> table(count.fields("~/Downloads/bugs.csv", sep=","))
# Just wanted to see if removing quote-recognition would help.... It didn't.
2 4 10 12 20 22 23 25 28 30
11308 24 20 33 642 251 10 2 170 124584
> which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)
[1] 104843 125158 127876 129734 130988 131456 132515 133048 136764
[10] 136765
I looked at the 23 with:
txt <-readLines("~/Downloads/bugs.csv")[
which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)]
And they had octothorpes ("#", hash-signs) which are comment characters in R data parlance.
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=",", comment.char=""))
30
137044
So.... use those settings in read.table
and you should be "good to go".
R skips lines from /dev/stdin
head --bytes=4K file | tail -n 3
yields this:
1039
1040
104
This suggests that R creates an input buffer on /dev/stdin, of size 4KB, and fills it during initialisation. When your R code then reads /dev/stdin, it starts in file at this point:
1
1042
1043
...
Indeed, if in file you replace the line 1041
by 1043
, you get a "3" instead of "1" in the table(x)
:
3 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053
1 1 1 1 1 1 1 1 1 1 1 1 1
...
The first 1
in table(x)
is actually the last digit of 1041
. The first 4KB of file have been eaten.
Reading in multiple CSVs with different numbers of lines to skip at start of file
The function fread
from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.
Here is an example code:
require(data.table)
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
lapply(list.files(pattern = "myfile.*.csv"), fread)
read.csv in R doesn't import all rows from csv file
The OP indicates that the problem is caused by quotes in the CSV-file.
When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote=""
option in read.csv
. This disables quotes.
data <- read.csv(filename, quote="")
Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.
lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))
A slightly more safe solution, which will only delete quotes when not just before or after a comma:
lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
Error reading csv as a zoo object - certain lines with 'bad entries'
Your csv
has empty values. You can fill with NAs and then turn into a zoo
object. You could try this:
x<- read.csv("OakParkR.csv", header=TRUE)
na.fill(x,NA)
x<- zoo(x)
x[33:35]
#date imax Tmax imin Tmin irain rain cbl wdsp ihm hm iddhm ddhm ihg hg soil
#33 02-Feb-07 0 9.1 0 -1.7 0 0.1 1026.2 3.9 0 10 0 340 0 14 5.970
#34 03-Feb-07 0 9.2 0 -3.0 0 0.0 <NA> 2.4 0 7 0 130 0 11 3.101
#35 04-Feb-07 0 7.7 0 -3.7 0 0.0 1031.8 3.3 0 8 0 330 0 12 2.668
Related Topics
Create a New Column with Non-Null Columns' Names
Ggplot Scale_X_Continuous with Symbol: Make Bold
Is There Something Like a Pmax Index
Understanding Bandwidth Smoothing in Ggplot2
Create a Variable That Identifies the Original Data.Frame After Rbind Command in R
Using Lubridate and Ggplot2 Effectively for Date Axis
Counting the Number of Values Greater Than 0 in R in Multiple Columns
Convert Numeric Vector to Binary (0/1) Based on Limit
Twitter Throws Forbidden Error After Entering Twitter API Pin
Function for Polynomials of Arbitrary Order (Symbolic Method Preferred)
How to Set Ggplot X-Label Equal to Variable Name During Lapply
List Elements to Dataframes in R
Calculate Difference Between Dates by Group in R
How to Substitute Symbols in a Language Object
Data.Table Joins - Select All Columns in the I Argument