How to Translate Arabic/Persian Numbers to English Using Ruby

How to translate Arabic/Persian numbers to english using Ruby?

For these one on one transformations the tr-method is very convenient and fast. It has a mutating counterpart in tr!

#encoding: utf-8

str1 = "١۲١۲"
str2 = "12١۲"
str3 = "some string that contains persian digits like ١۲"

[str1, str2, str3].each{|str| str.tr!('۰١۲۳۴۵۶۷۸۹','0123456789')}

p str1, str2, str3
#"1212"
#"1212"
#"some string that contains persian digits like 12"

String.maketrans for English and Persian numbers

See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.

In Python 2:

>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'

In Python 3:

>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'

Localization of numeric number in Rails

Ok, I've come up with that:

  def number to_convert, locale, text = nil,
    to_convert = to_convert.to_i.to_s
    case locale
    when 'ar'
      to_convert = to_convert.unpack('U*').map{ |e| e + 1584 }.pack('U*')
      text ? to_convert + ' ' + text : to_convert
    when 'fa'
      to_convert = to_convert.unpack('U*').map{ |e| e + 1728 }.pack('U*')
      text ? to_convert + ' ' + text : to_convert
    when 'hi'
      to_convert = to_convert.unpack('U*').map{ |e| e + 2358 }.pack('U*')
      text ? to_convert + ' ' + text : to_convert
    else
      text ? to_convert + ' ' + text : to_convert
    end
  end

Other language don't need custom localization. Ie. Chineese/Japanese people understand our number and it will be weird to support their local number as local people are using our number on the web.

Ruby extract arabic text from PDF

The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.

character definitions

Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:

characters drawn in mirrored shapes

-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).

As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.

Addition

I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:

/F1 1 Tf        % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ

Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):

<<
  /Type     /Font
  /Subtype  /Type1
  /BaseFont     /HGKECF+PHBagdad
  /Encoding     66 0 R
  /ToUnicode    69 0 R
>>

(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.

As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".

Addition 2

See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?

Quote:

Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.

I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

Regular Expression For Arabic Language

Try this:-

function HasArabicCharacters(text)
{
    var arregex = /[\u0600-\u06FF]/;
    alert(arregex.test(text));
}

Arabic character set of list

[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]

Arabic script in Unicode:

As of Unicode 6.1, the Arabic script is contained in the following blocks:

Arabic (0600—06FF, 225 characters)
Arabic Supplement (0750—077F, 48 characters)
Arabic Extended-A (08A0—08FF, 39 characters)
Arabic Presentation Forms-A (FB50—FDFF, 608 characters)
Arabic Presentation Forms-B (FE70—FEFF, 140 characters)
Rumi Numeral Symbols (10E60—10E7F, 31 characters)
Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)

Contents are taken from wikipedia - Arabic script in Unicode

Pitfalls when performing internationalization / localization with numbers?

In the big right-to-left scripts - Arabic, Hebrew and Thaana - numbers always run left to right. (When I say "Arabic", I refer to all the languages that are written in the Arabic script - Arabic, Farsi, Urdu, Pasto and many others.)

Hebrew and Thaana always use European digits, the same 0-9 set as English. There's nothing much to do there, because Unicode automatically takes care of ordering the numbers correctly. But see the comments about isolation below.

It's possible to use European digits in Arabic, too; for example, the Arabic Wikipedia uses them. However, very frequently Arabic texts use a different set of digits - https://en.wikipedia.org/wiki/Eastern_Arabic_numerals . It depends on your users' preferences. Notice also, that in the Persian language the digits are slightly different. From the point of view of right-to-left layout they behave pretty much the same way as European digits, although there are slight differences in the behavior of mathematical signs - for example, the minus can go on the other side. There are some subtleties here, but they are mostly edge cases.

In both Hebrew and Arabic you may run into a problem with bidi-isolation. For example, if you have a Hebrew paragraph in which you have an English word, and after the word you have numbers, the numbers will appear to the right of the word, although you may have wanted them to appear on the left. That's how the Unicode bidi algorithm works by default. To resolve such things you can use the Unicode control characters RLM and LRM. If you are using HTML5, you can also use the <bdi> tag for this, as well as the CSS rule "unicode-bidi: isolate". These CSS and HTML5 solutions are quite powerful and elegant, but aren't supported in all browsers yet.

I am aware of one script in which the digits run right-to-left: N'Ko, which is used for some languages of Africa. I actually saw websites written in it, but it is far less common than Hebrew and Arabic.

Finally, if you're using JavaScript, you can use the free jquery.i18n library for automatic number conversion. See https://github.com/wikimedia/jquery.i18n . (Disclaimer: I am one of this library's developers.)

How to Translate Arabic/Persian Numbers to English Using Ruby