R's read.csv prepending 1st column name with junk text
You've got a Unicode UTF-8 BOM at the start of the file:
http://en.wikipedia.org/wiki/Byte_order_mark
A text editor or web browser interpreting the text as ISO-8859-1 or
CP1252 will display the characters  for this
R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.
Here:
http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html
Duncan Murdoch suggests:
You can declare a file to be in encoding "UTF-8-BOM" if you want to
ignore a BOM on input
So try your read.csv
with fileEncoding="UTF-8-BOM"
or persuade your SQL wotsit to not output a BOM.
Otherwise you may as well test if the first name starts with ï..
and strip it with substr
(as long as you know you'll never have a column that does start like that genuinely...)
Why is R reading UTF-8 header as text?
So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file
):
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will
remove a Byte Order Mark if present (which it often is for files
and webpages generated by Microsoft applications).
which means that if you have a sufficiently new R interpreter,
read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)
should do what you want.
write.table writes unwanted leading empty column to header when has rownames
Citing ?write.table
, section CSV files:
By default there is no column name for
a column of row names. Ifcol.names =
and
NArow.names = TRUE
a blank
column name is added, which is the
convention used for CSV files to be
read by spreadsheets.
So you must do
write.table(a, 'a.txt', col.names=NA)
and you get
"" "A" "B" "C"
"A" 1 4 7
"B" 2 5 8
"C" 3 6 9
Progressive appending of data from read.csv
If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.
A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using
colClasses <- sapply(read.csv(file, nrow=100), class)
then give the result to the colClass argument.
If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.
On memory usage tricks:
Tricks to manage the available memory in an R session
On using the garbage collector function:
Forcing garbage collection to run in R with the gc() command
How to add a row of text above the output table when using write.table to copy and paste a data frame?
Clipboard alone
writeLines(
c("table name is mtcars",
capture.output(write.table(mtcars[1:3,], sep = "\t", row.names = FALSE))),
"clipboard")
... and then paste into Excel. I've run into issues in the past when the data has embedding issues (embedded tabs, etc) and perhaps something in the chain (including "me") did not handle all things correctly.
On windows, one could replace writeLines(.., "clipboard")
with writeClipboard
, but that function is windows only. On other OSes, one can install the clipr
package for clipboard reading/writing.
Using files
writeLines("table name is mtcars", con = "somefile.csv")
write.table(mtcars[1:3,], "somefile.csv", row.names = FALSE, append = TRUE, sep = ",")
# Warning in write.table(mtcars[1:3, ], "somefile.csv", row.names = FALSE, :
# appending column names to file
(One cannot use write.csv
, since it does not tolerate append=TRUE
, complaining attempt to set 'append' ignored
.)
Resulting file:
table name is mtcars
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
It opens in Excel as
Related Topics
How to Define More Line Types for Graphs in R (Custom Linetype)
Convert Data Frame with Date Column to Timeseries
Knitr Gets Tricked by Data.Table ':=' Assignment
Set Only Lower Bound of a Limit for Ggplot
Spearman Correlation by Group in R
How to Pivot/Unpivot (Cast/Melt) Data Frame
Too Few Periods for Decompose()
How to Jitter/Dodge Geom_Segments So They Remain Parallel
Find the Most Frequent Value by Row
Collapse Continuous Integer Runs to Strings of Ranges
R: Reshaping Multiple Columns from Long to Wide
Administrative Regions Map of a Country with Ggmap and Ggplot2
How to Group Data.Table by Multiple Columns
How to Delete Groups Containing Less Than 3 Rows of Data in R