R's Read.CSV Prepending 1St Column Name with Junk Text

R's read.csv prepending 1st column name with junk text

You've got a Unicode UTF-8 BOM at the start of the file:

http://en.wikipedia.org/wiki/Byte_order_mark

A text editor or web browser interpreting the text as ISO-8859-1 or
CP1252 will display the characters  for this

R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.

Here:

http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html

Duncan Murdoch suggests:

You can declare a file to be in encoding "UTF-8-BOM" if you want to
ignore a BOM on input

So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.

Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)

Why is R reading UTF-8 header as text?

So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file):

As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will
remove a Byte Order Mark if present (which it often is for files
and webpages generated by Microsoft applications).

which means that if you have a sufficiently new R interpreter,

read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)

should do what you want.

write.table writes unwanted leading empty column to header when has rownames

Citing ?write.table, section CSV files:

By default there is no column name for
a column of row names. If col.names =
NA
and row.names = TRUE a blank
column name is added, which is the
convention used for CSV files to be
read by spreadsheets.

So you must do

write.table(a, 'a.txt', col.names=NA)

and you get

"" "A" "B" "C"
"A" 1 4 7
"B" 2 5 8
"C" 3 6 9

Progressive appending of data from read.csv

If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.

A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using

colClasses <- sapply(read.csv(file, nrow=100), class)

then give the result to the colClass argument.

If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.

On memory usage tricks:
Tricks to manage the available memory in an R session

On using the garbage collector function:
Forcing garbage collection to run in R with the gc() command

How to add a row of text above the output table when using write.table to copy and paste a data frame?


Clipboard alone

writeLines(
c("table name is mtcars",
capture.output(write.table(mtcars[1:3,], sep = "\t", row.names = FALSE))),
"clipboard")

... and then paste into Excel. I've run into issues in the past when the data has embedding issues (embedded tabs, etc) and perhaps something in the chain (including "me") did not handle all things correctly.

On windows, one could replace writeLines(.., "clipboard") with writeClipboard, but that function is windows only. On other OSes, one can install the clipr package for clipboard reading/writing.

Using files

writeLines("table name is mtcars", con = "somefile.csv")
write.table(mtcars[1:3,], "somefile.csv", row.names = FALSE, append = TRUE, sep = ",")
# Warning in write.table(mtcars[1:3, ], "somefile.csv", row.names = FALSE, :
# appending column names to file

(One cannot use write.csv, since it does not tolerate append=TRUE, complaining attempt to set 'append' ignored.)

Resulting file:

table name is mtcars
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

It opens in Excel as

excel snapshot



Related Topics



Leave a reply



Submit