Detect language from string in PHP
You can not detect the language from the character type. And there are no foolproof ways to do this.
With any method, you're just doing an educated guess. There are available some math related articles out there
How to detect language of text?
You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.
If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.
For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:
Arabic (0600–06FF)
Japanese
- Hiragana (3040–309F)
- Katakana (30A0–30FF)
- Kanbun (3190–319F)
Chinese
- CJK Unified Ideographs (4E00–9FFF)
(I got the hex for your Chinese by using a Chinese to Unicode Converter.)
how to detect language from string like my php function in python 3
You can probably use a language detection library in python instead of using a regex match. Here's the link to langdetect, a language detection library that currently supports 55 languages.
How to detect language of user input
Use Text_LanguageDetect
from Pear
Installation:
sudo pear install Text_LanguageDetect
Usage
Example:
<?php
require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
echo "Supported languages:\n";
try {
$langs = $l->getLanguages();
sort($langs);
echo implode(', ', $langs) . "\n\n";
} catch (Text_LanguageDetect_Exception $e) {
die($e->getMessage());
}
$text = <<<EOD
Hallo! Das ist ein Text in deutscher Sprache.
Mal sehen, ob die Klasse erkennt, welche Sprache das hier ist.
EOD;
try {
//return 2-letter language codes only
$l->setNameMode(2);
$result = $l->detect($text, 4);
print_r($result);
} catch (Text_LanguageDetect_Exception $e) {
die($e->getMessage());
}
?>
Output:
Supported languages:
albanian, arabic, azeri, bengali, bulgarian, cebuano, croatian, czech,
danish, dutch, english, estonian, farsi, finnish, french, german, hausa,
hawaiian, hindi, hungarian, icelandic, indonesian, italian, kazakh, kyrgyz,
latin, latvian, lithuanian, macedonian, mongolian, nepali, norwegian, pashto,
pidgin, polish, portuguese, romanian, russian, serbian, slovak, slovene, somali,
spanish, swahili, swedish, tagalog, turkish, ukrainian, urdu, uzbek, vietnamese,
welsh
Array
(
[de] => 0.40703703703704
[nl] => 0.2880658436214
[en] => 0.28333333333333
[da] => 0.23452674897119
)
Note: This package is not maintained. Read more
Another example of PHP language detector:
crodas/LanguageDetector
Detect Browser Language in PHP
why dont you keep it simple and clean
<?php
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
$acceptLang = ['fr', 'it', 'en'];
$lang = in_array($lang, $acceptLang) ? $lang : 'en';
require_once "index_{$lang}.php";
?>
PHP check text is in English language?
In a case where you have only, English and Gujarati, why don't you do it the other way around?
if (preg_match('/\x{0A80}-\x{0AFF}/u', $Query)){
echo 'Gujarati';
}
else{
echo 'English';
}
Basically if you have one character from Gujarati language it will be detected as Gujarati
else it will be English
. However note that 月
,ありがとう
, élève
, etc will also be considered as English
Have a look at this Unicode chart: https://unicode.org/charts/PDF/U0A80.pdf to define exactly the range that must be taken into account.
Explanations:
\x{0A80}-\x{0AFF}
to match characters between code pointsU+0A80
andU+0AFF
/u
for Unicode support in regex
PHP regex: Detect language string from beginning of url
This part ([en|ru]{2})
does do what you think it does. It is a character class matching that repeats 2 times matching one chars of e
n
|
r
u
To prevent getting the empty entry, you could shorten your pattern to a single capturing group preventing without the alternation |
and make matching the /
and the rest of the line after it optional.
^(en|ru)(?:/.*)?$
^
Start of string(en|ru)
Capture group 1, match eitheren
orru
(?:/.*)?
Optionally match/
and the rest of the string$
End of string
See s regex demo and a Php demo.
$strings = [
"en/buy-a-ticket",
"en/tickets/something",
"en",
"en/",
"enhu",
"en-hu/niecohu",
"sk/hu-ngary",
"hu/contact"
];
$regex = "~^(en|ru)(?:/.*)?$~";
foreach ($strings as $s) {
if (preg_match($regex, $s, $match)) {
print_r($match);
}
}
Output
Array
(
[0] => en/buy-a-ticket
[1] => en
)
Array
(
[0] => en/tickets/something
[1] => en
)
Array
(
[0] => en
[1] => en
)
Array
(
[0] => en/
[1] => en
)
PHP: How do I detect if an input string is Arabic
hmm i may offer an improved version of DimaKrasun's function:
functoin is_arabic($string) {
if($string === 'arabic') {
return true;
}
return false;
}
okay, enough joking!
Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.
i think Rushyos approch is good! its just not that easy.
i wrote the following function for you but its not tested, but it should work...
<?
function uniord($u) {
// i just copied this function fron the php.net comments, but it should work fine!
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
function is_arabic($str) {
if(mb_detect_encoding($str) !== 'UTF-8') {
$str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
}
/*
$str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
$str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
*/
preg_match_all('/.|\n/u', $str, $matches);
$chars = $matches[0];
$arabic_count = 0;
$latin_count = 0;
$total_count = 0;
foreach($chars as $char) {
//$pos = ord($char); we cant use that, its not binary safe
$pos = uniord($char);
echo $char ." --> ".$pos.PHP_EOL;
if($pos >= 1536 && $pos <= 1791) {
$arabic_count++;
} else if($pos > 123 && $pos < 123) {
$latin_count++;
}
$total_count++;
}
if(($arabic_count/$total_count) > 0.6) {
// 60% arabic chars, its probably arabic
return true;
}
return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع');
var_dump($arabic);
?>
final thoughts:
as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)
you may also want to eliminate some chars first... maybe @, space, line breaks, slashes etc...
the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but because of the bug I didn't use it here.
you can as well have a counter for all the character sets and see which one of course the most...
and finally you should consider chopping your string off after 200 chars or something. this should be enough to tell what character set is used.
and you have to do some error handling! like division by zero, empty string etc etc! don't forget that please... any questions? comment!
if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre-defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are libraries for that anyway and this is not what you asked for :) just wanted to mention it
Related Topics
MySQLi Equivalent of MySQL_Result()
Pass JavaScript Array -≫ PHP
PHPmailer - Smtp Error: Password Command Failed When Send Mail from My Server
How to Get Input Field Value Using PHP
How to Loop Through This Array in PHP
Can Png Image Transparency Be Preserved When Using PHP'S Gdlib Imagecopyresampled
Encrypt and Decrypt Text With Rsa in PHP
Where Are $_Session Variables Stored
PHP, How to Catch a Division by Zero
PHP: How to Get All Possible Combinations of 1D Array
Does $_Server['Http_X_Requested_With'] Exist in PHP or Not
Weird PHP Error: 'Can't Use Function Return Value in Write Context'
Problem With PHP Mail 'From' Header