How to Read CSV Data with Unknown Encoding in R

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

This is a Windows-related encoding problem.

When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:

read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))

(adapted from R: can't read unicode text files even when specifying the encoding).

BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.

In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to read an unknown separator csv file in to R

This may introduce unforeseen errors, but it appears to provide the expected output:

library(data.table)
library(tidyverse)
test <- fread(file = "~/Downloads/1.csv")
#> Warning in fread(file = "~/Downloads/1.csv"): Detected 1 column names but the
#> data has 140 columns (i.e. invalid file). Added 139 extra default column names
#> at the end.
test_df <- as.data.frame(matrix(unlist(test, use.names = FALSE), ncol = 4, byrow = TRUE))
test_df %>% 
  separate(V1, c("id", "station"), extra = "merge") %>% 
  mutate(station = gsub(pattern = "0", replacement = "", x = station)) %>% 
  rename("lon" = V2,
         "lat" = V3,
         "RASTERVALU" = V4)
#>    id        station     lon    lat RASTERVALU
#> 1   1           东四 116.417 39.929   0.240687
#> 2   2           天坛 116.407 39.886  0.0992821
#> 3   3           官园 116.339 39.929   0.124302
#> 4   4       万寿西宫 116.352 39.878   0.239412
#> 5   5       奥体中心 116.397 39.982   0.236881
#> 6   6         农展馆 116.461 39.937    0.23076
#> 7   7           万柳 116.287 39.987   0.201353
#> 8   8       北部新区 116.174  40.09   0.170883
#> 9   9         植物园 116.207 40.002   0.210636
#> 10 10       丰台花园 116.279 39.863   0.225224
#> 11 11           云岗 116.146 39.824    0.23084
#> 12 12           古城 116.184 39.914    0.17514
#> 13 13       房山良乡 116.136 39.742   0.243377
#> 14 14     大兴黄村镇 116.404 39.718   0.295714
#> 15 15     亦庄开发区 116.506 39.795   0.315679
#> 16 16       通州新城 116.663 39.886   0.255555
#> 17 17       顺义新城 116.655 40.127   0.212804
#> 18 18         昌平镇  116.23 40.217   0.160067
#> 19 19   门头沟龙泉镇 116.106 39.937    0.17251
#> 20 20         平谷镇   117.1 40.143   0.275457
#> 21 21         怀柔镇 116.628 40.328   0.177003
#> 22 22         密云镇 116.832  40.37   0.253771
#> 23 23         延庆镇 115.972 40.453   0.219738
#> 24 24       昌平定陵  116.22 40.292    0.15908
#> 25 25   京西北八达岭 115.988 40.365      -9999
#> 26 26 京东北密云水库 116.911 40.499   0.173666
#> 27 27     京东东高村  117.12   40.1   0.276452
#> 28 28   京东南永乐店 116.783 39.712   0.278231
#> 29 29       京南榆垡   116.3  39.52   0.533654
#> 30 30   京西南琉璃河     116  39.58   0.449057
#> 31 31     前门东大街 116.395 39.899   0.236876
#> 32 32   永定门内大街 116.394 39.876   0.148231
#> 33 33   西直门北大街 116.349 39.954   0.234347
#> 34 34     南三环西路 116.368 39.856   0.177043
#> 35 35     东四环北路 116.483 39.939   0.253252

^{Created on 2021-07-26 by the reprex package (v2.0.0)}

How to detect the right encoding for read.csv?

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

I've struggle this many times and come to non-automatic solution:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

Then read data using each of them

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
                   fileEncoding=enc,
                   nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.

Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim)))
#        [,1] [,2]
# 437       14    2
# CP1200     3   29
# CP12000    0    1

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
#    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE 
#  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE"

You could look on data too

x[maybe_ok]

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

How to read a file with unknown encoding (FDF)

The encoding of the file is mixed.

Most of the PDF seems to be in latin1, as the first characters should be "%âãÏÓ". (See: PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?)

However the text within the "/V" command is encoded in UTF-16 little endian. The "fe ff" bytes are actually the byte order mark of the text.

You will probably need to resort to using readBin and converting the bytes to the right encoding. PDFs are horrible to parse.

See this http://stat545.com/block034_useR-encoding-case-study.html post on how to read files with mixed encoding using readBin. The iconv function may be useful as well for encoding conversion

read csv file in r with spanish characters (´,ñ)

Use the encoding option inside your read.csv code

    religion <- read.csv("religion.csv", header = TRUE, sep = ",", dec = ".",
                         filled =TRUE, comment.char = "", strip.white = TRUE,
                         stringsAsFactors = TRUE, encoding="UTF-8")

Remember, you can always check for documentation in R using help(function)