How to Detect Whether a Character Belongs to a Right to Left Language

How to detect whether a character belongs to a Right To Left language?

Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.

You are interested in characters with bidirectional property "R" or "AL" (RandALCat).

A RandALCat character is a character with unambiguously right-to-left directionality.

Here's the complete list as of Unicode 3.2 (from RFC 3454):


D. Bidirectional tables

D.1 Characters with bidirectional property "R" or "AL"

----- Start Table D.1 -----
05BE
05C0
05C3
05D0-05EA
05F0-05F4
061B
061F
0621-063A
0640-064A
066D-066F
0671-06D5
06DD
06E5-06E6
06FA-06FE
0700-070D
0710
0712-072C
0780-07A5
07B1
200F
FB1D
FB1F-FB28
FB2A-FB36
FB38-FB3C
FB3E
FB40-FB41
FB43-FB44
FB46-FBB1
FBD3-FD3D
FD50-FD8F
FD92-FDC7
FDF0-FDFC
FE70-FE74
FE76-FEFC
----- End Table D.1 -----

Here's some code to get the complete list as of Unicode 6.0:

var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";

var query = from record in new WebClient().DownloadString(url).Split('\n')
where !string.IsNullOrEmpty(record)
let properties = record.Split(';')
where properties[4] == "R" || properties[4] == "AL"
select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);

foreach (var codepoint in query)
{
Console.WriteLine(codepoint.ToString("X4"));
}

Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:

static void IsAnyCharacterRightToLeft(string s)
{
for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
{
var codepoint = char.ConvertToUtf32(s, i);
if (IsRandALCat(codepoint))
{
return true;
}
}
return false;
}

How to detect if a string contains any Right-to-Left character?

I came up with the following code:

char[] chars = s.toCharArray();
for(char c: chars){
if(c >= 0x600 && c <= 0x6ff){
//Text contains RTL character
break;
}
}

It's not a very efficient or for that matter an accurate way but can give one ideas.

JavaScript: how to check if character is RTL?

Thanks for your comments, but it seems I've done this myself:

function is_script_rtl(t) {
var d, s1, s2, bodies;

//If the browser doesn’t support this, it probably doesn’t support Unicode 5.2
if (!("getBoundingClientRect" in document.documentElement))
return false;

//Set up a testing DIV
d = document.createElement('div');
d.style.position = 'absolute';
d.style.visibility = 'hidden';
d.style.width = 'auto';
d.style.height = 'auto';
d.style.fontSize = '10px';
d.style.fontFamily = "'Ahuramzda'";
d.appendChild(document.createTextNode(t));

s1 = document.createElement("span");
s1.appendChild(document.createTextNode(t));
d.appendChild(s1);

s2 = document.createElement("span");
s2.appendChild(document.createTextNode(t));
d.appendChild(s2);

d.appendChild(document.createTextNode(t));

bodies = document.getElementsByTagName('body');
if (bodies) {
var body, r1, r2;

body = bodies[0];
body.appendChild(d);
var r1 = s1.getBoundingClientRect();
var r2 = s2.getBoundingClientRect();
body.removeChild(d);

return r1.left > r2.left;
}

return false;
}

Example of using:

Avestan in <script>document.write(is_script_rtl('') ? "RTL" : "LTR")</script>,
Arabic is <script>document.write(is_script_rtl('العربية') ? "RTL" : "LTR")</script>,
English is <script>document.write(is_script_rtl('English') ? "RTL" : "LTR")</script>.

It seems to work. :)

Detect string direction from string in PHP

This is actually more tricky than it should be. Each Unicode character has information which tells us if it is a RTL or LTR character, but I don't see a way of reading this information in PHP - instead you need to look up this information in a table of the Unicode characters.

I've put together a rather inefficient solution below, but I would suggest looking at this PHP implementation of Stringprep if you need something more robust. This library will also check the validity of the strings, e.g. it can enforce rules such as "no a mix of RTL and LTR chars in the same string". However, it is designed for preparing strings for use in internet protocols, rather than standard text, so the restrictions it imposes might get in the way of simply using it to check the text direction.

Thanks to this StackOverflow answer for information about where to get the Unicode data and how to interpret.

First we can create a file which has just the characters with the bidirectional properties called "R" or "AL" (RandALCat), this is stored in the 5th field of the Unicode data. This command grabs the data from that URL, removes characters which do not have AL or R in the 5th field, pads the restultant hex codes to 6 characters and saves it in a file called RandALCat.txt.

curl http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt |  \
egrep -e "([^;]*;){4}(AL|R);.*" | \
awk -F";" '{ printf("%06s\n", $1) }' > RandALCat.txt

We can then use this file in a function which tests each character in a string against it:

<?php

function isRTL($testString) {

$RandALCat = file('RandALCat.txt', FILE_IGNORE_NEW_LINES);
$codePoints = unpack('V*', iconv('UTF-8', 'UTF-32LE', $testString));

foreach ($codePoints as $codePoint) {
$hexCode = strtoupper(str_pad(dechex($codePoint), 6, '0', STR_PAD_LEFT));
if (array_search($hexCode, $RandALCat)) {
return true;
}
}

return false;

}

$englishText = 'Hello';
$arabicText = 'السلام عليكم';

var_dump(isRTL($englishText));
var_dump(isRTL($arabicText));

If you save this as test.php or something then run it, you should see this output:

$ php -q test.php
bool(false)
bool(true)

Is there any way to detect an RTL language in .NET?

CultureInfo.TextInfo.IsRightToLeft

In which order RTL language (Hebrew, Arabic, etc) strings are stored in memory?

According to How to detect whether a character belongs to a Right To Left language? - it seems they are stored left-to-right, and it's the character codes that dictate whether it's a RTL language.

How to handle right to left language

RTL reading is only a presentation, while in memory (and that is what counts for the ANTLR4 lexer) the characters are stored in increasing memory address order, just like for any other language. ANTLR4 is now fully Unicode aware and you should be able to write your rules in any language that is supported by Unicode (for both: the grammar rule names as well as the lexer content).



Related Topics



Leave a reply



Submit