R Corpus Is Messing Up My UTF-8 Encoded Text

Well, there seems to be good news and bad news.

The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at

# [1] "Складское помещение, 345 м²"

The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.

Try going a step further and building a DocumentTermMatrix

dtm <- DocumentTermMatrix(corp)

Hopefully you will see (as I do) the words correctly displayed.

If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.

How to properly encode UTF-8 txt files for R topic model

I found a workaround that seems to work correctly on the 2 example files that you supplied. What you need to do first is NFKD (Compatibility Decomposition). This splits the "fi" orthographic ligature into f and i. Luckily the stringi package can handle this. So before doing all the special text cleaning, you need to apply the function stringi::stri_trans_nfkd. You can do this in the preprocessing step just after (or before) the tolower step.

Do read the documentation for this function and the references.

docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

docs <-tm_map(docs,content_transformer(tolower))

# use stringi to fix all the orthographic ligature issues
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))

# add following line as well to remove special quotes.
# this uses a replace from textclean to replace the weird quotes
# which later get removed with removePunctuation
docs <- tm_map(docs, content_transformer(textclean::replace_curly_quote))

rest of process

R tm package: utf-8 text

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).

scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }

One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:

scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))

Then you get the result well encoded:

findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"

set encoding for reading text files into tm Corpora

From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8". If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.

Alternatively this will work, although it also makes tm unnecessary:

docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))

Then you can see the texts using for example:


They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:

docsTM <- Corpus(VectorSource(texts(docs)))

UTF-8 Character Encoding with TermDocumentMatrix

I've managed to duplicate your issue, and make changes to get Turkish output. Try changing the line

wordCorpus <- Corpus(VectorSource(tweetsg.df$text))


wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))

and adding a line similar to this.

Encoding(tweetsg.df$text)  <- "UTF-8"

The code I got to work was

sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish) <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

print(findFreqTerms(tdm, lowfreq=2))

This only worked with a source command from the console. i.e. clicking on run or source button in RStudio didn't work. I also made sure I chose "Save with Encoding" "UTF-8" (although this is probably only necessary because I have turkish text)

> source("Turkish.R")
[1] "değiştirdik"

It was the second answer R tm package: utf-8 text that was useful in the end.

