How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.
How to read an unknown separator csv file in to R
This may introduce unforeseen errors, but it appears to provide the expected output:
library(data.table)
library(tidyverse)
test <- fread(file = "~/Downloads/1.csv")
#> Warning in fread(file = "~/Downloads/1.csv"): Detected 1 column names but the
#> data has 140 columns (i.e. invalid file). Added 139 extra default column names
#> at the end.
test_df <- as.data.frame(matrix(unlist(test, use.names = FALSE), ncol = 4, byrow = TRUE))
test_df %>%
separate(V1, c("id", "station"), extra = "merge") %>%
mutate(station = gsub(pattern = "0", replacement = "", x = station)) %>%
rename("lon" = V2,
"lat" = V3,
"RASTERVALU" = V4)
#> id station lon lat RASTERVALU
#> 1 1 东四 116.417 39.929 0.240687
#> 2 2 天坛 116.407 39.886 0.0992821
#> 3 3 官园 116.339 39.929 0.124302
#> 4 4 万寿西宫 116.352 39.878 0.239412
#> 5 5 奥体中心 116.397 39.982 0.236881
#> 6 6 农展馆 116.461 39.937 0.23076
#> 7 7 万柳 116.287 39.987 0.201353
#> 8 8 北部新区 116.174 40.09 0.170883
#> 9 9 植物园 116.207 40.002 0.210636
#> 10 10 丰台花园 116.279 39.863 0.225224
#> 11 11 云岗 116.146 39.824 0.23084
#> 12 12 古城 116.184 39.914 0.17514
#> 13 13 房山良乡 116.136 39.742 0.243377
#> 14 14 大兴黄村镇 116.404 39.718 0.295714
#> 15 15 亦庄开发区 116.506 39.795 0.315679
#> 16 16 通州新城 116.663 39.886 0.255555
#> 17 17 顺义新城 116.655 40.127 0.212804
#> 18 18 昌平镇 116.23 40.217 0.160067
#> 19 19 门头沟龙泉镇 116.106 39.937 0.17251
#> 20 20 平谷镇 117.1 40.143 0.275457
#> 21 21 怀柔镇 116.628 40.328 0.177003
#> 22 22 密云镇 116.832 40.37 0.253771
#> 23 23 延庆镇 115.972 40.453 0.219738
#> 24 24 昌平定陵 116.22 40.292 0.15908
#> 25 25 京西北八达岭 115.988 40.365 -9999
#> 26 26 京东北密云水库 116.911 40.499 0.173666
#> 27 27 京东东高村 117.12 40.1 0.276452
#> 28 28 京东南永乐店 116.783 39.712 0.278231
#> 29 29 京南榆垡 116.3 39.52 0.533654
#> 30 30 京西南琉璃河 116 39.58 0.449057
#> 31 31 前门东大街 116.395 39.899 0.236876
#> 32 32 永定门内大街 116.394 39.876 0.148231
#> 33 33 西直门北大街 116.349 39.954 0.234347
#> 34 34 南三环西路 116.368 39.856 0.177043
#> 35 35 东四环北路 116.483 39.939 0.253252
Created on 2021-07-26 by the reprex package (v2.0.0)
How to detect the right encoding for read.csv?
First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.
I've struggle this many times and come to non-automatic solution:
Use iconvlist
to get all possible encodings:
codepages <- setNames(iconvlist(), iconvlist())
Then read data using each of them
x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
fileEncoding=enc,
nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here
Important here is to know structure of file (separator, headers). Set encoding using fileEncoding
argument. Read only few rows.
Now you could lookup on results:
unique(do.call(rbind, sapply(x, dim)))
# [,1] [,2]
# 437 14 2
# CP1200 3 29
# CP12000 0 1
Seems like correct one is that with 3 rows and 29 columns, so lets see them:
maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
# CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE
# "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE"
You could look on data too
x[maybe_ok]
For your file all this encodings returns identical data (partially because there is some redundancy as you see).
If you don't know specific of your file you need to use readLines
with some changes in workflow (e.g. you can't use fileEncoding
, must use length
instead of dim
, do more magic to find correct ones).
How to read a file with unknown encoding (FDF)
The encoding of the file is mixed.
Most of the PDF seems to be in latin1, as the first characters should be "%âãÏÓ". (See: PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?)
However the text within the "/V" command is encoded in UTF-16 little endian. The "fe ff" bytes are actually the byte order mark of the text.
You will probably need to resort to using readBin and converting the bytes to the right encoding. PDFs are horrible to parse.
See this http://stat545.com/block034_useR-encoding-case-study.html post on how to read files with mixed encoding using readBin. The iconv function may be useful as well for encoding conversion
read csv file in r with spanish characters (´,ñ)
Use the encoding
option inside your read.csv
code
religion <- read.csv("religion.csv", header = TRUE, sep = ",", dec = ".",
filled =TRUE, comment.char = "", strip.white = TRUE,
stringsAsFactors = TRUE, encoding="UTF-8")
Remember, you can always check for documentation in R using help(function)
Related Topics
What Is the Correct Way to Ask for User Input in an R Program
Multiple Lines for Text Per Legend Label in Ggplot2
Three-Way Color Gradient Fill in R
Stylecolorbar Center and Shift Left/Right Dependent on Sign
Emacs Ess Mode - Tabbing for Comment Region
How to Define a Vectorized Function in R
Geom_Point() and Geom_Line() for Multiple Datasets on Same Graph in Ggplot2
Transparent Equivalent of Given Color
How to Count How Many Values Per Level in a Given Factor
Automatic Documentation of Datasets
How to Highlight Time Ranges on a Plot
How to Write from R to the Clipboard on a MAC
Regression Tables in Markdown Format (For Flexible Use in R Markdown V2)
Avoiding Type Conflicts with Dplyr::Case_When
Deleting Rows That Are Duplicated in One Column Based on the Conditions of Another Column
How to Request an Early Exit When Knitting an Rmd Document
Find Overlapping Dates for Each Id and Create a New Row for the Overlap