R Corpus Is Messing Up My UTF-8 Encoded Text
Well, there seems to be good news and bad news.
The good news is that the data appears to be fine even if it doesn't display correctly with inspect()
. Try looking at
content(corp[[2]])
# [1] "Складское помещение, 345 м²"
The reason it looks funny in inspect()
is because the authors changed the way the print.PlainTextDocument
function works. It formerly would cat
the value to screen. Now, however, they feed the data though writeLines()
. This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()
). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.
Try going a step further and building a DocumentTermMatrix
dtm <- DocumentTermMatrix(corp)
Terms(dtm)
Hopefully you will see (as I do) the words correctly displayed.
If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout()
on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.
How to properly encode UTF-8 txt files for R topic model
I found a workaround that seems to work correctly on the 2 example files that you supplied. What you need to do first is NFKD (Compatibility Decomposition). This splits the "fi" orthographic ligature into f and i. Luckily the stringi package can handle this. So before doing all the special text cleaning, you need to apply the function stringi::stri_trans_nfkd
. You can do this in the preprocessing step just after (or before) the tolower step.
Do read the documentation for this function and the references.
library(tm)
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))
#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))
# use stringi to fix all the orthographic ligature issues
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))
toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))
# add following line as well to remove special quotes.
# this uses a replace from textclean to replace the weird quotes
# which later get removed with removePunctuation
docs <- tm_map(docs, content_transformer(textclean::replace_curly_quote))
....
rest of process
.....
R tm package: utf-8 text
The problem comes from the default tokenizer. tm
by default uses scan_tokenizer
which it looses encoding(maybe you should contact the maintainer to add an encoding argument).
scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }
One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit
:
scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))
Then you get the result well encoded:
findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
set encoding for reading text files into tm Corpora
From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8"
. If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.
Alternatively this will work, although it also makes tm unnecessary:
require(quanteda)
docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))
Then you can see the texts using for example:
cat(texts(docs)[1:2])
They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:
docsTM <- Corpus(VectorSource(texts(docs)))
UTF-8 Character Encoding with TermDocumentMatrix
I've managed to duplicate your issue, and make changes to get Turkish output. Try changing the line
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
to
wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))
and adding a line similar to this.
Encoding(tweetsg.df$text) <- "UTF-8"
The code I got to work was
library(tm)
sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish) <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)
print(findFreqTerms(tdm, lowfreq=2))
This only worked with a source
command from the console. i.e. clicking on run or source button in RStudio didn't work. I also made sure I chose "Save with Encoding" "UTF-8" (although this is probably only necessary because I have turkish text)
> source("Turkish.R")
[1] "değiştirdik"
It was the second answer R tm package: utf-8 text that was useful in the end.
Related Topics
Replicate a List to Create a List-Of-Lists
Can You Pass a Vector to a Vararg: Vector to Sprintf
Freezing Header and First Column Using Data.Table in Shiny
Ggplot2 Add a Legend for Several Stat_Functions
R: Replacing Foreign Characters in a String
Ggplot2': Label Values of Barplot That Uses 'Fun.Y="Mean"' of 'Stat_Summary'
R Output Without [1], How to Nicely Format
How to Create a Variable of Rownames
Navlistpanel: Make Tabs Sequentially Active in Shiny App
How to Install R-Packages Not in the Conda Repositories
How Is Data Passed from Reactive Shiny Expression to Ggvis Plot
Determine Season from Date Using Lubridate in R
How to Add Gaussian Curve to Histogram Created with Qplot
Linear Model with 'Lm': How to Get Prediction Variance of Sum of Predicted Values