Preventing column-class inference in fread()
Option 1: Using a system command
fread()
allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.
indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
The system command cat test.csv | tr -d '\"'
explained:
cat test.csv
reads the file to standard output|
is a pipe, using the output of the previous command as input for the next commandtr -d '\"'
deletes (-d
) all occurrences of double quotes ('\"'
) from the current input
Option 2: Coercion after reading
Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x
column with type.convert()
.
library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
Side note: I usually prefer to use type.convert()
over as.numeric()
to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,
x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1] 1 4 NA 6
# Warning message:
# NAs introduced by coercion
type.convert(x)
# [1] 1 4 NA 6
But of course you can use as.numeric()
as well.
Note: This answer assumes data.table dev v1.9.5
fread data.table in R doesn't read in column names
R always try to convert column names to ensure that they are valid variable names, hence it adds periods in place of spaces and colons. If you dont want that you can use check.names=FALSE
while using read.table
df1<-read.table("data.txt",check.names = FALSE)
sample(colnames(df1),10)
[1] "simple lobule white matter"
[2] "anterior lobule white matter"
[3] "hippocampus"
[4] "lateral olfactory tract"
[5] "lobules 1-2: lingula and central lobule (ventral)"
[6] "Medial parietal association cortex"
[7] "Primary somatosensory cortex: trunk region"
[8] "midbrain"
[9] "Secondary auditory cortex: ventral area"
[10] "Primary somatosensory cortex: forelimb region"
you can see that colnames
are kept as it is.
Read huge .csv file with some columns in single quotes but not all with fread from the data.table package
Summary of the answers given in comments:
Solution #1:
Thanks to @Psidom and @jangorecki
Install data.table v. 1.9.7:
install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
Then run:
homimpdt <- fread("homimp.csv", quote = "\'")
EDIT: Current version of data.table on CRAN is 1.9.6
Solution #2 (linux only):
thanks to @RichScriven
can be found here:
Preventing column-class inference in fread()
and set as.is = TRUE
in the type.convert()
function
fread importing empty as NA
A few possible things going on here:
Regardless of you writing
"0"
here, the reading function (fread
) is inferring based on looking at a portion of the file. This is not uncommon (readr
does it, too), and is controllable (withcolClasses=
).This might be unique to your question here (and not your real data), but your call to
write.csv
is implicitly putting the literalNA
letters in the file (not to be confused with"NA"
where you have the literal string). This might be confusing things, even when you override withcolClasses=
.You might already know this, but since
fread
is inferring that those columns are reallyinteger
classes, then they cannot contain empty strings: once determined to be a number column, anything non-number-like will beNA
.
Let's redo your first csv-generating side to make sure we don't confound the situation.
write.csv(matrix(c("0","",NA,"NA"),ncol = 2), "MRE.csv", na="")
(Below, I'm using magrittr
's pipe operator %>%
merely for presentation, it is not required.)
The first example demonstrates fread
's inference. The second shows our overriding that behavior, and now we have blank strings in each NA
spot that is not the literal string "NA"
.
fread("MRE.csv") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: int 0 NA
# $ V2: logi NA NA
# - attr(*, ".internal.selfref")=<externalptr>
fread("MRE.csv", colClasses="character") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
This can also be controlled on a per-column basis. One issue with this example is that fread
is for some reason forcing the column of row-names to be named V1
, the same as the next column. This looks like a bug to me, perhaps you can look at Rdatatable's issues and potentially post a new one. (I might be wrong, perhaps this is intentional/known behavior.)
Because of this, per-column overriding seems to stop at the first occurrence of a column name.
fread("MRE.csv", colClasses=c(V1="character", V2="character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: int 0 NA
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
One way around this is to go with an unnamed vector, requiring the same number of classes as the number of columns:
fread("MRE.csv", colClasses=c("character","character","character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Another way (thanks @thelatemail) is with a list:
fread("MRE.csv", colClasses=list(character=2:3)) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Side note: if you need to preserve them as ints/nums, then:
if your concern is about how it affects follow-on calculations, then you can:
- fix the source of the data so that nulls are not provided;
- filter out the incomplete observations (rows); or
- fix the calculations to deal intelligently with missing data.
if your concern is about how it looks in a report, then whatever tool you are using to render in your report should have a mechanism for how to display
NA
values; for example, settingoptions(knitr.kable.NA="")
beforeknitr::kable(...)
will present them as empty strings.if your concern is about how it looks on your console, you have two options:
- interfere with the data by iterating over each (intended) column and changing
NA
values to""
; this only works oncharacter
columns, and is irreversible; or - write your own subclass of
data.frame
that changes how it is displayed on the console; the benefit to this is that it is non-destructive; the problem is that you have to re-class each object where you want this behavior, and most (if not all) functions that output frames will likely inadvertently strip or omit that class from your input. (You'll need to write an S3 method ofprint
for your subclass to do this.)
- interfere with the data by iterating over each (intended) column and changing
read.csv warning 'EOF within quoted string' prevents complete reading of file
You need to disable quoting.
cit <- read.csv("citations.CSV", quote = "",
row.names = NULL,
stringsAsFactors = FALSE)
str(cit)
## 'data.frame': 112543 obs. of 13 variables:
## $ row.names : chr "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ...
## $ id : chr "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ...
## $ doi : chr "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ...
## $ title : chr "Bruce D. Smith\t" "Tomás Ó Cathasaigh\t" "Hiram G. Larew\t" "\t" ...
## $ author : chr "American Anthropologist\t" "Ériu\t" "Economic Botany\t" "The Illustrated Magazine of Art\t" ...
## $ journaltitle : chr "79\t" "54\t" "41\t" "1\t" ...
## $ volume : chr "3\t" "\t" "1\t" "3\t" ...
## $ issue : chr "1977-09-01T00:00:00Z\t" "2004-01-01T00:00:00Z\t" "1987-01-01T00:00:00Z\t" "1853-01-01T00:00:00Z\t" ...
## $ pubdate : chr "pp. 598-617\t" "pp. 41-47\t" "pp. 33-40\t" "pp. 171-172\t" ...
## $ pagerange : chr "American Anthropological Association\tWiley\t" "Royal Irish Academy\t" "New York Botanical Garden Press\tSpringer\t" "\t" ...
## $ publisher : chr "fla\t" "fla\t" "fla\t" "fla\t" ...
## $ type : logi NA NA NA NA NA NA ...
## $ reviewed.work: logi NA NA NA NA NA NA ...
I think is because of this kind of lines (check "Thorn" and "Minus")
readLines("citations.CSV")[82]
[1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t,"
Related Topics
Reduce Space Between Grid.Arrange Plots
Using R - Delete Rows When a Value Repeated Less Than 3 Times
Extracting Common Character Strings from Multiple Vectors of Different Lengths
Use Csl-File for PDF-Output in Bookdown
In Shiny Apps for R, How to Delay the Firing of a Reactive
Get the Vector of Values from Different Columns of a Matrix
Flexdashboard - Change Title Bar Color
R Histogram with Multiple Populations
Multi Line Title in Ggplot 2 with Multiple Italicized Words
Filling Bars in Barplot with Textiles in Ggplot2
Rename Columns in Multiple Dataframes, R
How to Filter on Partial Match Using Sparklyr
Fill in Na Based on the Last Non-Na Value for Each Group in R