How to Detect Language of User Entered Text

How to detect language of user entered text?

Here are two options

  • LanguageIdentifier
  • Rosette Language Identifier

How to detect language of text?

You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.

If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.

For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:

Arabic (0600–06FF)

Japanese

  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Kanbun (3190–319F)

Chinese

  • CJK Unified Ideographs (4E00–9FFF)

(I got the hex for your Chinese by using a Chinese to Unicode Converter.)

How to determine the language of a piece of text?

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

detect user input language javascript

https://stackoverflow.com/a/14824756/104380

I had come up with a new, much shorter solution:

function isRTL(s){           
var ltrChars = 'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF',
rtlChars = '\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC',
rtlDirCheck = new RegExp('^[^'+ltrChars+']*['+rtlChars+']');

return rtlDirCheck.test(s);
};

playground page

How to detect language

Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.

In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.

How to detect the language of a string?

If the context of your code have internet access, you can try to use the Google API for language detection.
http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
if (!result.error) {
var language = 'unknown';
for (l in google.language.Languages) {
if (google.language.Languages[l] == result.language) {
language = l;
break;
}
}
var container = document.getElementById("detection");
container.innerHTML = text + " is: " + language + "";
}
});

And, since you are using c#, take a look at this article on how to call the API from c#.

UPDATE:
That c# link is gone, here's a cached copy of the core of it:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

[Serializable]
public class JSONResponse
{
public string responseDetails = null;
public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

[Serializable]
public class Translation: JSONResponse
{
public TranslationResponseData responseData =
new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

[Serializable]
public class TranslationResponseData
{
public string translatedText;
}

Finally, we can make the GoogleTranslator class:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

public class GoogleTranslator
{
private string _q = "";
private string _v = "";
private string _key = "";
private string _langPair = "";
private string _requestUrl = "";
private string _translation = "";

public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
LANGUAGE languageTo, string key)
{
_q = HttpUtility.UrlPathEncode(queryTerm);
_v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
_langPair =
HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
"|" + EnumStringUtil.GetStringValue(languageTo));
_key = HttpUtility.UrlEncode(key);

string encodedRequestUrlFragment =
string.Format("?v={0}&q={1}&langpair={2}&key={3}",
_v, _q, _langPair, _key);

_requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

GetTranslation();
}

public string Translation
{
get { return _translation; }
private set { _translation = value; }
}

private void GetTranslation()
{
try
{
WebRequest request = WebRequest.Create(_requestUrl);
WebResponse response = request.GetResponse();

StreamReader reader = new StreamReader(response.GetResponseStream());
string json = reader.ReadLine();
using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
{
DataContractJsonSerializer ser =
new DataContractJsonSerializer(typeof(Translation));
Translation translation = ser.ReadObject(ms) as Translation;

_translation = translation.responseData.translatedText;
}
}
catch (Exception) { }
}
}
}


Related Topics



Leave a reply



Submit