Difference between read.csv() and read.csv2() in R
They are (almost) the same functions - read.table
. The only difference is default parameters. Look at source code:
> read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x5e3fa88>
<environment: namespace:utils>
> read.csv2
function (file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x5c0a330>
<environment: namespace:utils>
From doc (see ?read.table
):
read.csv
and read.csv2
are identical to read.table
except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.
What is the practical difference between read_csv and read.csv? When should one be used over another?
Quoted from the introduction page.
11.2.1 Compared to base R
If you’ve used R before, you might wonder why we’re not using read.csv()
. There are a few good reasons to favour readr functions over the base equivalents:
They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread()
. It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
They produce tibbles, they don’t convert character vectors to factors*, use row names, or munge the column names. These are common sources of frustration with the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.
*Note that from R 4.0.0
R [...] uses a
stringsAsFactors = FALSE
default, and hence by default no longer converts strings to factors in calls todata.frame()
andread.table()
.
read.csv vs. read.table
read.csv
is a fairly thin wrapper around read.table
; I would be quite surprised if you couldn't exactly replicate the behaviour of read.csv
by supplying the correct arguments to read.table
. However, some of those arguments (such as the way that quotation marks or comment characters are handled) could well change the speed and behaviour of the function.
In particular, this is the full definition of read.csv
:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...) {
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
}
so as stated it's just read.table
with a particular set of options.
As @Chase states in the comments below, the help page for read.table()
says just as much under Details
:
read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.
What are objective benefits and drawbacks of read.csv() versus read_csv()?
read_csv
is significantly faster for large .csv files. See here for more information. Personally, I pretty much always use read_csv
by default.
Why are the results of read_csv larger than those of read.csv?
Selecting the right functions is of course very important for writing efficient code.
The degree of optimization present in different functions and packages will impact how objects are stored, their size, and the speed of operations running on them. Please consider the following.
library(data.table)
a <- c(1:1000000)
b <- rnorm(1000000)
mat <- as.matrix(cbind(a, b))
df <- data.frame(a, b)
dt <- data.table::as.data.table(mat)
cat(paste0("Matrix size: ",object.size(mat), "\ndf size: ", object.size(df), " (",round(object.size(df)/object.size(mat),2) ,")\ndt size: ", object.size(dt), " (",round(object.size(dt)/object.size(mat),2),")" ))
Matrix size: 16000568
df size: 12000848 (0.75)
dt size: 4001152 (0.25)
So here already you see that data.table
stores the same data using 4 times less space than your old matrix
does, and 3 times less than data.frame
. Now about operations speed:
> microbenchmark(df[df$a*df$b>500,], mat[mat[,1]*mat[,2]>500,], dt[a*b>500])
Unit: milliseconds
expr min lq mean median uq max neval
df[df$a * df$b > 500, ] 23.766201 24.136201 26.49715 24.34380 30.243300 32.7245 100
mat[mat[, 1] * mat[, 2] > 500, ] 13.010000 13.146301 17.18246 13.41555 20.105450 117.9497 100
dt[a * b > 500] 8.502102 8.644001 10.90873 8.72690 8.879352 112.7840 100
data.table
does the filtering 1.7 times faster than base
on data.frame
, and 2.5 times faster than using a matrix
.
And that's not all, for almost any CSV import, using data.table::fread
will change your life. Give it a try instead of read.csv
or read_csv
.
IMHO data.table
doesn't get half the love it deserves, the best all-round package for performance and a very concise syntax. The following vignettes should put you on your way quickly, and that is worth the effort, trust me.
For further performance improvements Rfast
contains many Rcpp
implementations of popular functions and problems, such as rowSort()
for example.
EDIT: fread
's speed is due to optimizations done at C-code level involving the use of pointers for memory mapping, and coerce-as-you-go techniques, which frankly are beyond my knowledge to explain. This post contains some explanations by the author Matt Dowle, as well as an interesting, if short, piece of discussion between him and the author of dplyr
, Hadley Wickham.
Read in CSV in mixed English and French number format
Most of it is resolved with dec=","
,
# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
# 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352,00 3.37 16.5 61.50
# 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692,00 6.49 25.0 91.40
# 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505,00 18.60 31.1 7.64
# 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400,00 11.70 22.7 6.85
Only one column is string:
sapply(out, class)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "character" "numeric" "numeric"
# pjnk
# "numeric"
This can be resolved post-read with:
ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
# [1] 414 352 403 692 505 400 956 1407 207 3051
If you'd rather read it in without post-processing, you can pipe(.)
it, assuming you have sed
available[^1]:
out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))
Notes:
sed
is generally available on all linux/macos systems, and on windows computers it is included within Rtools.
Related Topics
How to Convert Entire Dataframe to Numeric While Preserving Decimals
How to Request an Early Exit When Knitting an Rmd Document
Emacs Ess Mode - Tabbing for Comment Region
How to Merge Two Data.Table by Different Column Names
Multiple Strings with Str_Detect R
Texture in Barplot for 7 Bars in R
Calculating the Difference Between Consecutive Rows by Group Using Dplyr
Why Are Xs Added to Data Frame Variable Names When Using Read.Csv
How to Create Datatable with Complex Header in R Shiny
Sum of Antidiagonal of a Matrix
How to Screenshot a Website Using R
Is Data Really Copied Four Times in R's Replacement Functions
Assign New Data Point to Cluster in Kernel K-Means (Kernlab Package in R)
Ctree() - How to Get the List of Splitting Conditions for Each Terminal Node
Change Internal Function of a Package