How to Determine the Language of a Piece of Text

How to determine the language of a piece of text?

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

Python: What is the best way to determine the language of a text based on each letter's frequency?

I think a dictionary structured as {'English': {'a': ..., 'b': ..., ... }, 'French': {...}, ...} makes more sense for two reasons:

  1. You can immediately get a dictionary with the exact same structure as your frequency dictionary for the sample text.

  2. Each language can have different sets of characters.

Once you do this, a good place to start is by calculating the "distance" between your sample frequencies and the frequencies for each language. There are several "distance" metrics, including Manhattan distance and Euclidean distance. Try several of these to get multiple data points for measuring "closeness".

How do I tell what language is a plain-text file written in?

There is a package called JLangDetect which seems to do exactly what you want:

langof("un texte en français") = fr : OK
langof("a text in english") = en : OK
langof("un texto en español") = es : OK
langof("un texte un peu plus long en français") = fr : OK
langof("a text a little longer in english") = en : OK
langof("a little longer text in english") = en : OK
langof("un texto un poco mas largo en español") = es : OK
langof("J'aime les bisounours !") = fr : OK
langof("Bienvenue à Montmartre !") = fr : OK
langof("Welcome to London !") = en : OK
// ...

Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.

Recognizing language of a short text?

I'd use the guess-language project.

Edit: Now in Bitbucket

How to detect language of text?

You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.

If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.

For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:

Arabic (0600–06FF)

Japanese

  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Kanbun (3190–319F)

Chinese

  • CJK Unified Ideographs (4E00–9FFF)

(I got the hex for your Chinese by using a Chinese to Unicode Converter.)

How to detect the dominant language of a text word?

Simply because it belongs to too many languages and it would be unrealistic to guess the language based on one word. The context always helps.

For example :

import NaturalLanguage

let recognizer = NLLanguageRecognizer()
recognizer.processString("Islam")
print(recognizer.dominantLanguage!.rawValue) //Force unwrapping for brevity

prints tr, which stands for Turkish. It's an educated guess.

If you want the other languages that were also possible, you could use languageHypotheses(withMaximum:):

let hypotheses = recognizer.languageHypotheses(withMaximum: 10)

for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {
print(lang.rawValue, confidence)
}

Which prints

tr 0.2332388460636139   //Turkish
hr 0.1371040642261505 //Croatian
en 0.12280254065990448 //English
pt 0.08051242679357529
de 0.06824589520692825
nl 0.05405258387327194
nb 0.050924140959978104
it 0.037797268480062485
pl 0.03097432479262352
hu 0.0288708433508873

Now you could define an acceptable threshold of confidence in order to accept that result.


Language codes can be found here



Related Topics



Leave a reply



Submit