Detect Text Language in R

Detect text language in R

The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

Here's the abstract:

Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.

And here's one of their examples:

library("textcat")
textcat(c(
"This is an English sentence.",
"Das ist ein deutscher Satz.",
"Esta es una frase en espa~nol."))
[1] "english" "german" "spanish"

Detecting and Retrieving text from a column based on language model in R

It may not be vectorized. We can use rowwise

library(dplyr)
df %>%
rowwise %>%
mutate(out = tryCatch(gl_tranlsate_detect(text),
error = function(e) NA_character_))

Or with lapply to loop over each of the elements in 'text' column and apply the function

lapply(df$text, gl_translate_detect)

Natural language identification and assign as like en, fr, tr

Your best shot is probably cldr, it uses Chrome's language detection library.

library(devtools)
install_github("aykutfirat/cldr")

library(cldr)

docs1 <- c(
"Detects the language of a set of documents with possible input hints. Returns the top 3 candidate languages and their probabilities as well.",
"Som nevnt på møte forrige uke er det ulike ting som skjer denne og neste uke.",
"Ganz besonders wollen wir, dass forthin allenthalben in unseren Städten, Märkten und auf dem Lande zu keinem Bier mehr Stücke als allein Gersten, Hopfen und Wasser verwendet und gebraucht werden sollen.",
"Роман Гёте «Вильгельм Майстер» заложил основы воспитательного романа эпохи Просвещения.")

detectLanguage(docs1)$detectedLanguage
# [1] "ENGLISH" "NORWEGIAN" "GERMAN" "RUSSIAN"

However, your examples seems to be a bit too short.

docs2 <- c("I am a musician", "я инженер", "Je suis un poète")

detectLanguage(docs2)$detectedLanguage
# [1] "Unknown" "Unknown" "Unknown"

As noted by Ben textcat seems to perform better on the shorter examples given by gulnerman, but unlike cldr it doesn't indicate how reliable the matches are. This makes it difficult to say how much you can trust the results, even though two out of three were correct in this case.

library(textcat)
textcat(docs2)
# [1] "latin" "russian-iso8859_5" "french"

Detect text language in R

The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

Here's the abstract:

Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.

And here's one of their examples:

library("textcat")
textcat(c(
"This is an English sentence.",
"Das ist ein deutscher Satz.",
"Esta es una frase en espa~nol."))
[1] "english" "german" "spanish"

Language detection in R with the textcat package : how to restrict to a few languages?

This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.

my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
my.profiles

my.text <- c("This is an English sentence.",
"Das ist ein deutscher Satz.",
"Il s'agit d'une phrase française.",
"Esta es una frase en espa~nol.")

textcat(my.text, p = my.profiles)

# [1] "english" "english" "french" "french"

Text preprocessing in a different language

Both German and Greek are found in the stemming and stopword language lists, so both should be easy to apply in quanteda.

library("quanteda")
## Package version: 3.2.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

txt_german <- "Wie kann ich eine natürliche Sprachverarbeitung für Texte in anderen Sprachen durchführen?"
txt_greek <- "Πώς μπορώ να πραγματοποιήσω επεξεργασία φυσικής γλώσσας σε κείμενα σε άλλες γλώσσες;"

tokens(txt_german, remove_punct = TRUE) %>%
tokens_remove(stopwords("de")) %>%
tokens_wordstem(language = "de")
## Tokens consisting of 1 document.
## text1 :
## [1] "natur" "Sprachverarbeit" "Text" "Sprach"
## [5] "durchfuhr"

tokens(txt_greek, remove_punct = TRUE) %>%
tokens_remove(stopwords("de")) %>%
tokens_wordstem(language = "de")
## Tokens consisting of 1 document.
## text1 :
## [1] "Πώς" "μπορώ" "να" "πραγματοποιήσω"
## [5] "επεξεργασία" "φυσικής" "γλώσσας" "σε"
## [9] "κείμενα" "σε" "άλλες" "γλώσσες"


Related Topics



Leave a reply



Submit