Convert a file encoding using R? (ANSI to UTF-8)
you can use iconv:
writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), "tmp2.html")
tmp2.html should be utf-8.
Edit by Henrik in June 2015:
A working solution for Windows distilled from the comments is as follows:
writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"),
file("tmp2.html", encoding="UTF-8"))
Update 2021: And if ANSI is the current locale, the following works as well (i.e., uses the local encoding as from
source):
writeLines(iconv(readLines("tmp.html"), from = "", to = "UTF8"),
file("tmp2.html", encoding="UTF-8"))
How covert a set of Unicode .txt to ANSI for text analysis in R
get all txt file
files <- list.files(path=getwd(), pattern="*.txt", full.names=T, recursive=FALSE)
loop for converting the encoding and overwrite it
for(i in 1:length(files)){
input <- readLines(files[i])
converted_input <- iconv(input, from = file_encoding, to = file_encoding)
writeLines(converted_input,files[i])
}
possible encodings can be viewed by the iconvlist()
command
Can we convert ANSI encoded CSV file to utf-8 encoded file with javascript?
I am sorry I misread you question at first. Your trouble comes at the moment of reading the file. It is useless to try to convert the file after contents were ruined when trying to load in a wrong encoding. I came up with FileReader API in this fiddle inspired by this example and article on html5rocks
var r = new FileReader();
r.readAsText(f, 'windows-1252');
The only trouble I see here is there is no auto encoding detection. You need to know encoding before loading the file.
How do I correct the character encoding of a file?
EDIT: A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.
Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.
As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.
Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.
Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.
UTF-8 file output in R
The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)
To write text UTF8 encoding on Windows one has to use the useBytes=T
options in functions like writeLines or readLines:
txt <- "在"
writeLines(txt, "test.txt", useBytes=T)
readLines("test.txt", encoding="UTF-8")
[1] "在"
Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.
Related Topics
Assign Column Names to List of Dataframes
Sum Amount Last 6 Month Prior to the Date of Transaction
Plot Curved Lines Between Two Locations in Ggplot2
Major and Minor Tickmarks with Plotly
How to Show a Loading Screen When the Output Is Being Calculated in a Background Process
Visualizing Two or More Data Points Where They Overlap (Ggplot R)
How to Represent Polynomials with Numeric Vectors in R
Enriching a Ggplot2 Plot with Multiple Geom_Segment in a Loop
Split a File Path into Folder Names Vector
Find and Replace Missing Values with Row Mean
How to Simultaneously Apply Color/Shape/Size in a Scatter Plot Using Plotly
How to Cache Data in Shiny Server