How to Detect the Language of a String

How to detect the language of a string?

If the context of your code have internet access, you can try to use the Google API for language detection.
http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
if (!result.error) {
var language = 'unknown';
for (l in google.language.Languages) {
if (google.language.Languages[l] == result.language) {
language = l;
break;
}
}
var container = document.getElementById("detection");
container.innerHTML = text + " is: " + language + "";
}
});

And, since you are using c#, take a look at this article on how to call the API from c#.

UPDATE:
That c# link is gone, here's a cached copy of the core of it:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

[Serializable]
public class JSONResponse
{
public string responseDetails = null;
public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

[Serializable]
public class Translation: JSONResponse
{
public TranslationResponseData responseData =
new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

[Serializable]
public class TranslationResponseData
{
public string translatedText;
}

Finally, we can make the GoogleTranslator class:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

public class GoogleTranslator
{
private string _q = "";
private string _v = "";
private string _key = "";
private string _langPair = "";
private string _requestUrl = "";
private string _translation = "";

public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
LANGUAGE languageTo, string key)
{
_q = HttpUtility.UrlPathEncode(queryTerm);
_v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
_langPair =
HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
"|" + EnumStringUtil.GetStringValue(languageTo));
_key = HttpUtility.UrlEncode(key);

string encodedRequestUrlFragment =
string.Format("?v={0}&q={1}&langpair={2}&key={3}",
_v, _q, _langPair, _key);

_requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

GetTranslation();
}

public string Translation
{
get { return _translation; }
private set { _translation = value; }
}

private void GetTranslation()
{
try
{
WebRequest request = WebRequest.Create(_requestUrl);
WebResponse response = request.GetResponse();

StreamReader reader = new StreamReader(response.GetResponseStream());
string json = reader.ReadLine();
using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
{
DataContractJsonSerializer ser =
new DataContractJsonSerializer(typeof(Translation));
Translation translation = ser.ReadObject(ms) as Translation;

_translation = translation.responseData.translatedText;
}
}
catch (Exception) { }
}
}
}

How to determine the language of a piece of text?

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

Detect language from string in PHP

You can not detect the language from the character type. And there are no foolproof ways to do this.

With any method, you're just doing an educated guess. There are available some math related articles out there

Detect a language of a string in node.js

You can use the languagedetect node.js library to detect the language of the string.

However, since your requirement is to send the message based on the user's language, it is better to provide him an option to select his preferred language or use javascript to detect language version of the browser with navigator.language

How to detect language of user entered text?

Here are two options

  • LanguageIdentifier
  • Rosette Language Identifier

How to detect text (string) language in iOS?


Latest versions (iOS 12+)

Briefly:

You could achieve it by using NLLanguageRecognizer, as:

import NaturalLanguage

func detectedLanguage(for string: String) -> String? {
let recognizer = NLLanguageRecognizer()
recognizer.processString(string)
guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
return detectedLanguage
}

Older versions (iOS 11+)

Briefly:

You could achieve it by using NSLinguisticTagger, as:

func detectedLanguage<T: StringProtocol>(for string: T) -> String? {
let recognizer = NLLanguageRecognizer()
recognizer.processString(String(string))
guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
return detectedLanguage
}

Details:

First of all, you should be aware of what are you asking about is mainly relates to the world of Natural language processing (NLP).

Since NLP is more than text language detection, the rest of the answer will not contains specific NLP information.

Obviously, implementing such a functionality is not that easy, especially when starting to care about the details of the process such as splitting into sentences and even into words, after that recognising names and punctuations etc... I bet you would think of "what a painful process! it is not even logical to do it by myself"; Fortunately, iOS does supports NLP (actually, NLP APIs are available for all Apple platforms, not only the iOS) to make what are you aiming for to be easy to be implemented. The core component that you would work with is NSLinguisticTagger:

Analyze natural language text to tag part of speech and lexical class,
identify names, perform lemmatization, and determine the language and
script.

NSLinguisticTagger provides a uniform interface to a variety of
natural language processing functionality with support for many
different languages and scripts. You can use this class to segment
natural language text into paragraphs, sentences, or words, and tag
information about those segments, such as part of speech, lexical
class, lemma, script, and language.

As mentioned in the class documentation, the method that you are looking for - under Determining the Dominant Language and Orthography section- is dominantLanguage(for:):

Returns the dominant language for the specified string.

.

.

Return Value

The BCP-47 tag identifying the dominant language of the string, or the
tag "und" if a specific language cannot be determined.

You might notice that the NSLinguisticTagger is exist since back to iOS 5. However, dominantLanguage(for:) method is only supported for iOS 11 and above, that's because it has been developed on top of the Core ML Framework:

. . .

Core ML is the foundation for domain-specific frameworks and
functionality. Core ML supports Vision for image analysis, Foundation
for natural language processing (for example, the NSLinguisticTagger
class), and GameplayKit for evaluating learned decision trees. Core ML
itself builds on top of low-level primitives like Accelerate and BNNS,
as well as Metal Performance Shaders.

Sample Image

Based on the returned value from calling dominantLanguage(for:) by passing "The quick brown fox jumps over the lazy dog":

NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")

would be "en" optional string. However, so far that is not the desired output, the expectation is to get "English" instead! Well, that is exactly what you should get by calling the localizedString(forLanguageCode:) method from Locale Structure and passing the gotten language code:

Locale.current.localizedString(forIdentifier: "en") // English

Putting all together:

As mentioned in the "Quick Answer" code snippet, the function would be:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String? {
guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
return nil
}

let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)

return detectedLanguage
}

Output:

It would be as expected:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

Note That:

There still cases for not getting a language name for a given string, like:

let textUND = "SdsOE"
let undefinedDetectedLanguage = detectedLanguage(textUND) // => Unknown language

Or it could be even nil:

let rubbish = "000747322"
let rubbishDetectedLanguage = detectedLanguage(rubbish) // => nil

Still find it a not bad result for providing a useful output...


Furthermore:

About NSLinguisticTagger:

Although I will not going to dive deep in NSLinguisticTagger usage, I would like to note that there are couple of really cool features exist in it more than just simply detecting the language for a given a text; As a pretty simple example: using the lemma when enumerating tags would be so helpful when working with Information retrieval, since you would be able to recognize the word "driving" passing "drive" word.

Official Resources

Apple Video Sessions:

  • For more about Natural Language Processing and how NSLinguisticTagger works: Natural Language Processing and your Apps.

Also, for getting familiar with the CoreML:

  • Introducing Core ML.
  • Core ML in depth.


Related Topics



Leave a reply



Submit