Difference Between Read.Csv() and Read.Csv2() in R

Difference between read.csv() and read.csv2() in R

They are (almost) the same functions - read.table. The only difference is default parameters. Look at source code:

> read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x5e3fa88>
<environment: namespace:utils>
> read.csv2
function (file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x5c0a330>
<environment: namespace:utils>

From doc (see ?read.table):

read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.

What is the practical difference between read_csv and read.csv? When should one be used over another?

Quoted from the introduction page.

11.2.1 Compared to base R

If you’ve used R before, you might wonder why we’re not using read.csv(). There are a few good reasons to favour readr functions over the base equivalents:

They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.

They produce tibbles, they don’t convert character vectors to factors*, use row names, or munge the column names. These are common sources of frustration with the base R functions.

They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.


*Note that from R 4.0.0

R [...] uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table().

read.csv vs. read.table

read.csv is a fairly thin wrapper around read.table; I would be quite surprised if you couldn't exactly replicate the behaviour of read.csv by supplying the correct arguments to read.table. However, some of those arguments (such as the way that quotation marks or comment characters are handled) could well change the speed and behaviour of the function.

In particular, this is the full definition of read.csv:

function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
fill = TRUE, comment.char = "", ...) {
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
}

so as stated it's just read.table with a particular set of options.

As @Chase states in the comments below, the help page for read.table() says just as much under Details:

read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.

What are objective benefits and drawbacks of read.csv() versus read_csv()?

read_csv is significantly faster for large .csv files. See here for more information. Personally, I pretty much always use read_csv by default.

Why are the results of read_csv larger than those of read.csv?

Selecting the right functions is of course very important for writing efficient code.
The degree of optimization present in different functions and packages will impact how objects are stored, their size, and the speed of operations running on them. Please consider the following.

library(data.table)
a <- c(1:1000000)
b <- rnorm(1000000)
mat <- as.matrix(cbind(a, b))
df <- data.frame(a, b)
dt <- data.table::as.data.table(mat)
cat(paste0("Matrix size: ",object.size(mat), "\ndf size: ", object.size(df), " (",round(object.size(df)/object.size(mat),2) ,")\ndt size: ", object.size(dt), " (",round(object.size(dt)/object.size(mat),2),")" ))
Matrix size: 16000568
df size: 12000848 (0.75)
dt size: 4001152 (0.25)

So here already you see that data.table stores the same data using 4 times less space than your old matrix does, and 3 times less than data.frame. Now about operations speed:

> microbenchmark(df[df$a*df$b>500,], mat[mat[,1]*mat[,2]>500,], dt[a*b>500])
Unit: milliseconds
expr min lq mean median uq max neval
df[df$a * df$b > 500, ] 23.766201 24.136201 26.49715 24.34380 30.243300 32.7245 100
mat[mat[, 1] * mat[, 2] > 500, ] 13.010000 13.146301 17.18246 13.41555 20.105450 117.9497 100
dt[a * b > 500] 8.502102 8.644001 10.90873 8.72690 8.879352 112.7840 100

data.table does the filtering 1.7 times faster than base on data.frame, and 2.5 times faster than using a matrix.

And that's not all, for almost any CSV import, using data.table::fread will change your life. Give it a try instead of read.csv or read_csv.

IMHO data.table doesn't get half the love it deserves, the best all-round package for performance and a very concise syntax. The following vignettes should put you on your way quickly, and that is worth the effort, trust me.

For further performance improvements Rfast contains many Rcpp implementations of popular functions and problems, such as rowSort() for example.


EDIT: fread's speed is due to optimizations done at C-code level involving the use of pointers for memory mapping, and coerce-as-you-go techniques, which frankly are beyond my knowledge to explain. This post contains some explanations by the author Matt Dowle, as well as an interesting, if short, piece of discussion between him and the author of dplyr, Hadley Wickham.

Read in CSV in mixed English and French number format

Most of it is resolved with dec=",",

# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
# 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352,00 3.37 16.5 61.50
# 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692,00 6.49 25.0 91.40
# 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505,00 18.60 31.1 7.64
# 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400,00 11.70 22.7 6.85

Only one column is string:

sapply(out, class)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "character" "numeric" "numeric"
# pjnk
# "numeric"

This can be resolved post-read with:

ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
# [1] 414 352 403 692 505 400 956 1407 207 3051

If you'd rather read it in without post-processing, you can pipe(.) it, assuming you have sed available[^1]:

out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))

Notes:

  1. sed is generally available on all linux/macos systems, and on windows computers it is included within Rtools.


Related Topics



Leave a reply



Submit