Detect text language in R
The textcat
package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:
Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.
Here's the abstract:
Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.
And here's one of their examples:
library("textcat")
textcat(c(
"This is an English sentence.",
"Das ist ein deutscher Satz.",
"Esta es una frase en espa~nol."))
[1] "english" "german" "spanish"
Detecting and Retrieving text from a column based on language model in R
It may not be vectorized. We can use rowwise
library(dplyr)
df %>%
rowwise %>%
mutate(out = tryCatch(gl_tranlsate_detect(text),
error = function(e) NA_character_))
Or with lapply
to loop over each of the elements in 'text' column and apply the function
lapply(df$text, gl_translate_detect)
Natural language identification and assign as like en, fr, tr
Your best shot is probably cldr
, it uses Chrome's language detection library.
library(devtools)
install_github("aykutfirat/cldr")
library(cldr)
docs1 <- c(
"Detects the language of a set of documents with possible input hints. Returns the top 3 candidate languages and their probabilities as well.",
"Som nevnt på møte forrige uke er det ulike ting som skjer denne og neste uke.",
"Ganz besonders wollen wir, dass forthin allenthalben in unseren Städten, Märkten und auf dem Lande zu keinem Bier mehr Stücke als allein Gersten, Hopfen und Wasser verwendet und gebraucht werden sollen.",
"Роман Гёте «Вильгельм Майстер» заложил основы воспитательного романа эпохи Просвещения.")
detectLanguage(docs1)$detectedLanguage
# [1] "ENGLISH" "NORWEGIAN" "GERMAN" "RUSSIAN"
However, your examples seems to be a bit too short.
docs2 <- c("I am a musician", "я инженер", "Je suis un poète")
detectLanguage(docs2)$detectedLanguage
# [1] "Unknown" "Unknown" "Unknown"
As noted by Ben textcat
seems to perform better on the shorter examples given by gulnerman, but unlike cldr
it doesn't indicate how reliable the matches are. This makes it difficult to say how much you can trust the results, even though two out of three were correct in this case.
library(textcat)
textcat(docs2)
# [1] "latin" "russian-iso8859_5" "french"
Detect text language in R
The textcat
package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:
Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.
Here's the abstract:
Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.
And here's one of their examples:
library("textcat")
textcat(c(
"This is an English sentence.",
"Das ist ein deutscher Satz.",
"Esta es una frase en espa~nol."))
[1] "english" "german" "spanish"
Language detection in R with the textcat package : how to restrict to a few languages?
This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
my.profiles
my.text <- c("This is an English sentence.",
"Das ist ein deutscher Satz.",
"Il s'agit d'une phrase française.",
"Esta es una frase en espa~nol.")
textcat(my.text, p = my.profiles)
# [1] "english" "english" "french" "french"
Text preprocessing in a different language
Both German and Greek are found in the stemming and stopword language lists, so both should be easy to apply in quanteda.
library("quanteda")
## Package version: 3.2.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
txt_german <- "Wie kann ich eine natürliche Sprachverarbeitung für Texte in anderen Sprachen durchführen?"
txt_greek <- "Πώς μπορώ να πραγματοποιήσω επεξεργασία φυσικής γλώσσας σε κείμενα σε άλλες γλώσσες;"
tokens(txt_german, remove_punct = TRUE) %>%
tokens_remove(stopwords("de")) %>%
tokens_wordstem(language = "de")
## Tokens consisting of 1 document.
## text1 :
## [1] "natur" "Sprachverarbeit" "Text" "Sprach"
## [5] "durchfuhr"
tokens(txt_greek, remove_punct = TRUE) %>%
tokens_remove(stopwords("de")) %>%
tokens_wordstem(language = "de")
## Tokens consisting of 1 document.
## text1 :
## [1] "Πώς" "μπορώ" "να" "πραγματοποιήσω"
## [5] "επεξεργασία" "φυσικής" "γλώσσας" "σε"
## [9] "κείμενα" "σε" "άλλες" "γλώσσες"
Related Topics
How to Get a Barplot with Several Variables Side by Side Grouped by a Factor
Cut() Error - 'Breaks' Are Not Unique
Smaller Gap Between Two Legends in One Plot (E.G. Color and Size Scale)
Call by Reference in R (Using Function to Modify an Object)
Error in Loading Rgl Package with MAC Os X
Looping Through T.Tests for Data Frame Subsets in R
Complete Column with Group_By and Complete
Administrative Regions Map of a Country with Ggmap and Ggplot2
R: How to Rbind Two Huge Data-Frames Without Running Out of Memory
Lda with Topicmodels, How to See Which Topics Different Documents Belong To
Add a Horizontal Line to Plot and Legend in Ggplot2
R - When Trying to Install Package: Internetopenurl Failed
Ggplot2: How to Use Same Colors in Different Plots for Same Factor
Creating a Density Histogram in Ggplot2