Multilanguage Utf-8 Website with Arabic

Multilanguage UTF-8 website with Arabic

Reading http://www.w3.org/International/tutorials/bidi-xhtml/ and http://en.wikipedia.org/wiki/Internationalization_and_localization could be useful.

Some things I can think of:

  • your choice of colors and images could prove offensive or bad taste in some countries
  • every image with text should be translated (image and alt); every image with directionality should be reversed (ex: an arrow)
  • try to avoid class naming like class="left" if you don't want future headaches. Top, bottom, before or after are OK I think but not left/right.
  • you'll have to check each CSS instructions about text-align, background-position, float, clear and obviously left and right with position: absolute/relative;
  • different fonts need different font sizes (though this problem concerns asiatic fonts mainly)
  • as for any other supported language, many bits of text in templates should be translated.

By "css selector to swith text alignment", do you mean dir="rtl" ? This is an HTML attribute. But you'll still need a class ('ll be fine on the body element) to act like a giant switch for your design needs. Like

.en .yourclass { background: url(images/en/bg.jpg) } 
.ar .yourclass { background: url(images/ar/bg.jpg) }

edit: an attribute selector would do the same but then there are those bad ol' IE ...

:lang(ar) .yourclass { background: url(images/ar/bg.jpg) }
or
[lang|="ar"] .yourclass { background: url(images/ar/bg.jpg) }

can't write in arabic in an html page with utf encoding in distant server

Add this line to your script, someplace before it produces any output:

header("Content-Type: text/html; charset=utf-8");

Currently the website emits this header which may confuse some web browsers.

Content-Type: text/html; charset=iso-8859-1

How do I get both arabic and english text to show in a wordpress blog post?

Looks like a database charset problem. Here are a few things you can try:

Easiest

Check in Settings > Reading that you've got encoding set to UTF-8.

Easy

Open up your wp-config.php and find this line:

 define('DB_CHARSET', 'utf8');

Change it to:

 define('DB_CHARSET', '');

PITA

Convert your database charset to UTF-8, as explained in this article.

Handling multi language website

There are a whole load of areas of your system that need to be considered when looking at multi-lingual systems.

You need to to ensure that you are using a suitable character encoding throughout your system. In most cases, the best choice of character encoding is UTF-8. (There are some instances where UTF-8 is insufficient, for which cases there is UTF-16, but these cases are few and far between, and PHP will struggle with UTF-16 anyway, so in general stick with UTF-8 for everything and you'll be fine).

You need to make sure you're using the same character encoding in the following places:

  • Your database tables.
  • Your web server.
  • Your PHP source code.

The database is easy to deal with: just make sure all tables are created with UTF-8 encoding for their charset. Job done.

Collation is less relevant -- this specifies the sort order. This does matter of course, but does not have any relevance to the garbled text display you're seeing. (it's worth saying that some characters are sorted differently in different languages, so it's virtually impossible to pick a collation mode that will suit everyone if you need to support multiple languages in a single table, but I wouldn't get too worried about this for now).

The web server is relatively simple too, as long as you're comfortable with Apache config (or whatever server software you're using). You need to ensure that all pages output to the browser are sent using UTF-8 encoding.

Finally, your PHP source code...

Firstly, you should make sure you're editing the actual PHP code files in UTF-8 mode. Otherwise, any you may have trouble if you have any extended characters written in your code.

Secondly, be aware that a number of PHP's standard string handling functions are "not multi-byte aware". This means that they don't work correctly with extended character sets. For example, strlen() will return the number of bytes the string takes up in memory. This will be incorrect if your string includes characters that take up more than one byte. Fortunately, PHP also supplies a set of multi-byte functions to resolve this. So for example, instead of using strlen(), use mb_strlen(). The PHP manual gives more detail about the exact functions available and when to use them.

Also, make sure that you handle any incoming posted data with the correct character set as well.

Hopefully that will help you. The key here is to ensure that your system uses a consistent character set throughout all its layers. Problems with weird-looking encoding errors tend to happen when one layer in your system is using a different character set to the others. Make sure they're all the same (and preferably UTF-8), and you should deal with your garbled character problems.

Multilanguage sites: Left-to-right and right-to-left

A good practice would be to use the lang attribute to describe which language is being used: http://www.w3.org/TR/REC-html40/struct/dirlang.html

I would define the language within the Head, and if necessary locally within the document.

You don't mention which doctype you are using, but if you are using XHTML then there are also xml lang attributes to consider:
http://www.w3schools.com/Xhtml/xhtml_syntax.asp

I don't know if it is 'best practice', but when I worked on an english and arabic site recently I found it useful to use CSS classes for setting rtl and ltr.

What UTF-8 languages can be displayed 'safe' in HTML without any specific presentational adjustments?

I am not really sure what you mean by "presentational adjustments". And what it has to do with UTF-8.

The first thing, UTF-8 is just a character encoding, the way to represent Unicode. It is capable of presenting any national character (the only problems could pop-up additional 4-bytes Chinese characters defined by GB18030:2005, but with Unicode 6.0 I believe this will go away).

Another thing: Non-Latin scripts may require adjustments either way - font face and size might need to be changed anyway.

Third thing: You mentioned direction: rtl, which is CSS keyword. The problem is, you should rather use HTML dir attribute to switch directionality - this is in line with W3C recommendations.

Last thing: I wouldn't fix list of RTL languages to say Arabic, Hebrew, Urdu and Persian (Farsi), as there are other Bi-Di languages.

In other words, I would consider what to do to allow style and directionality modifications for the end user rather than hard-coding it.



Related Topics



Leave a reply



Submit