What Are Best Practices For Multi-Language Database Design

What are best practices for multi-language database design?

What we do, is to create two tables for each multilingual object.

E.g. the first table contains only language-neutral data (primary key, etc.) and the second table contains one record per language, containing the localized data plus the ISO code of the language.

In some cases we add a DefaultLanguage field, so that we can fall-back to that language if no localized data is available for a specified language.

Example:

Table "Product":
----------------
ID : int



Table "ProductTranslations"
---------------------------
ID : int (foreign key referencing the Product)
Language : varchar (e.g. "en-US", "de-CH")
IsDefault : bit
ProductDescription : nvarchar

With this approach, you can handle as many languages as needed (without having to add additional fields for each new language).


Update (2014-12-14): please have a look at this answer, for some additional information about the implementation used to load multilingual data into an application.

What's the best database structure to keep multilingual data?

Your third example is actually the way the problem is usually solved. Hard, but doable.

Remove the reference to product from the translation table and put a reference to translation where you need it (the other way around).

[ products ]
id (INT)
price (DECIMAL)
title_translation_id (INT, FK)

[ translation ]
id (INT, PK)
neutral_text (VARCHAR)
-- other properties that may be useful (date, creator etc.)

[ translation_text ]
translation_id (INT, FK)
language_id (INT, FK)
text (VARCHAR)

As an alternative (not especially a good one) you can have one single field and keep all translations there merged together (as XML, for example).


Supplier
Lieferant
Fournisseur

Schema for a multilanguage database

What do you think about having a related translation table for each translatable table?

CREATE TABLE T_PRODUCT (pr_id int, PRICE NUMBER(18, 2))

CREATE TABLE T_PRODUCT_tr (pr_id INT FK, languagecode varchar, pr_name text, pr_descr text)

This way if you have multiple translatable column it would only require a single join to get it + since you are not autogenerating a translationid it may be easier to import items together with their related translations.

The negative side of this is that if you have a complex language fallback mechanism you may need to implement that for each translation table - if you are relying on some stored procedure to do that. If you do that from the app this will probably not be a problem.

Let me know what you think - I am also about to make a decision on this for our next application.
So far we have used your 3rd type.

Best Practices to Design multi-language web site database

On the DB side, use unicode (nvarchar and nchar) for your text.

Have a locale lookup table (containing the locale info you need - i.e. decimal point format), and for each table that has localized data add a foreign key to it.

Edit: (following comments)

Best practice is to use resource files for localization.

Database design for multiple language website

There are multiple ways on how to store multilanguage data in the database, I usually do it like this:

Item
ID
price
title_translation_key_id (is a TranslationKey foreign key)
desc_translation_key_id (is a TranslationKey foreign key)

TranslationKey
ID
key (string)

Translation
ID
translation_key_id (is a TranslationKey foreign key)
content (string)
language_id (is a Language foreign key)

Language
ID
code

How to retrieve the data?

You can either work with sub-selects or if you database doesn't perform well, you can generate language specific tables out of it.

Database modeling for international and multilingual purposes

Here is the way I would design the database:

Data model

Visualization by DB Designer Fork

The i18n table only contains a PK, so that any table just has to reference this PK to internationalize a field. The table translation is then in charge of linking this generic ID with the correct list of translations.

locale.id_locale is a VARCHAR(5) to manage both of en and en_US ISO syntaxes.

currency.id_currency is a CHAR(3) to manage the ISO 4217 syntax.

You can find two examples: page and newsletter. Both of these admin-managed entites need to internationalize their fields, respectively title/description and subject/content.

Here is an example query:

select
t_subject.tx_translation as subject,
t_content.tx_translation as content

from newsletter n

-- join for subject
inner join translation t_subject
on t_subject.id_i18n = n.i18n_subject

-- join for content
inner join translation t_content
on t_content.id_i18n = n.i18n_content

inner join locale l

-- condition for subject
on l.id_locale = t_subject.id_locale

-- condition for content
and l.id_locale = t_content.id_locale

-- locale condition
where l.id_locale = 'en_GB'

-- other conditions
and n.id_newsletter = 1

Note that this is a normalized data model. If you have a huge dataset, maybe you could think about denormalizing it to optimize your queries. You can also play with indexes to improve the queries performance (in some DB, foreign keys are automatically indexed, e.g. MySQL/InnoDB).

Best practice multi language website

Topic's premise

There are three distinct aspects in a multilingual site:

  • interface translation
  • content
  • url routing

While they all interconnected in different ways, from CMS point of view they are managed using different UI elements and stored differently. You seem to be confident in your implementation and understanding of the first two. The question was about the latter aspect - "URL Translation? Should we do this or not? and in what way?"

What the URL can be made of?

A very important thing is, don't get fancy with IDN. Instead favor transliteration (also: transcription and romanization). While at first glance IDN seems viable option for international URLs, it actually does not work as advertised for two reasons:

  • some browsers will turn the non-ASCII chars like 'ч' or 'ž' into '%D1%87' and '%C5%BE'
  • if user has custom themes, the theme's font is very likely to not have symbols for those letters

I actually tried to IDN approach few years ago in a Yii based project (horrible framework, IMHO). I encountered both of the above mentioned problems before scraping that solution. Also, I suspect that it might be an attack vector.

Available options ... as I see them.

Basically you have two choices, that could be abstracted as:

  • http://site.tld/[:query]: where [:query] determines both language and content choice

  • http://site.tld/[:language]/[:query]: where [:language] part of URL defines the choice of language and [:query] is used only to identify the content

Query is Α and Ω ..

Lets say you pick http://site.tld/[:query].

In that case you have one primary source of language: the content of [:query] segment; and two additional sources:

  • value $_COOKIE['lang'] for that particular browser
  • list of languages in HTTP Accept-Language (1), (2) header

First, you need to match the query to one of defined routing patterns (if your pick is Laravel, then read here). On successful match of pattern you then need to find the language.

You would have to go through all the segments of the pattern. Find the potential translations for all of those segments and determine which language was used. The two additional sources (cookie and header) would be used to resolve routing conflicts, when (not "if") they arise.

Take for example: http://site.tld/blog/novinka.

That's transliteration of "блог, новинка", that in English means approximately "blog", "latest".

As you can already notice, in Russian "блог" will be transliterated as "blog". Which means that for the first part of [:query] you (in the best case scenario) will end up with ['en', 'ru'] list of possible languages. Then you take next segment - "novinka". That might have only one language on the list of possibilities: ['ru'].

When the list has one item, you have successfully found the language.

But if you end up with 2 (example: Russian and Ukrainian) or more possibilities .. or 0 possibilities, as a case might be. You will have to use cookie and/or header to find the correct option.

And if all else fails, you pick the site's default language.

Language as parameter

The alternative is to use URL, that can be defined as http://site.tld/[:language]/[:query]. In this case, when translating query, you do not need to guess the language, because at that point you already know which to use.

There is also a secondary source of language: the cookie value. But here there is no point in messing with Accept-Language header, because you are not dealing with unknown amount of possible languages in case of "cold start" (when user first time opens site with custom query).

Instead you have 3 simple, prioritized options:

  1. if [:language] segment is set, use it
  2. if $_COOKIE['lang'] is set, use it
  3. use default language

When you have the language, you simply attempt to translate the query, and if translation fails, use the "default value" for that particular segment (based on routing results).

Isn't here a third option?

Yes, technically you can combine both approaches, but that would complicate the process and only accommodate people who want to manually change URL of http://site.tld/en/news to http://site.tld/de/news and expect the news page to change to German.

But even this case could probable be mitigated using cookie value (which would contain information about previous choice of language), to implement with less magic and hope.

Which approach to use?

As you might already guessed, I would recommend http://site.tld/[:language]/[:query] as the more sensible option.

Also in real word situation you would have 3rd major part in URL: "title". As in name of the product in online shop or headline of article in news site.

Example: http://site.tld/en/news/article/121415/EU-as-global-reserve-currency

In this case '/news/article/121415' would be the query, and the 'EU-as-global-reserve-currency' is title. Purely for SEO purposes.

Can it be done in Laravel?

Kinda, but not by default.

I am not too familiar with it, but from what I have seen, Laravel uses simple pattern-based routing mechanism. To implement multilingual URLs you will probably have to extend core class(es), because multilingual routing need access to different forms of storage (database, cache and/or configuration files).

It's routed. What now?

As a result of all you would end up with two valuable pieces of information: current language and translated segments of query. These values then can be used to dispatch to the class(es) which will produce the result.

Basically, the following URL: http://site.tld/ru/blog/novinka (or the version without '/ru') gets turned into something like

$parameters = [
'language' => 'ru',
'classname' => 'blog',
'method' => 'latest',
];

Which you just use for dispatching:

$instance = new {$parameter['classname']};
$instance->{'get'.$parameters['method']}( $parameters );

.. or some variation of it, depending on the particular implementation.

How to use multilanguage database schema with ORM?

Using only one table for all translations may quickly lead to severe performance and complexity issues for any operation (select/insert/update/delete) ; just try it to understand what I mean.

I would then go for the following method (two tables per "translatable" object), which seems a good balance between performance, complexity, and maintenance issues.

product
- id (PK)
- number1
- number2
- date1
- date2
...

product_i18n
- id (PK)
- product_id (FK)
- language_id (FK)
- string1
- string2
...

language
- id (PK)
- name

Check how it's done with Propel ORM (PHP) : http://propelorm.org/blog/2011/01/11/propel-gets-i18n-behavior-and-why-it-matters.html

HTH

What are the best practices for multilanguage sites?

In addition to @Quassnoi's answers ensure that you standard RFC 4646 language identifiers (e.g. EN-US, DE-AT); you may already be aware of this. The CLDR project is an excellent repository of internationalization data (the Supplemental Data is really useful).

If a translation of a specific page is not available, use a language fallback mechanism back to the neutral language; for example "DE-AT", "DE", "" (neutral, e.g. "EN").

Most recent browsers and the underlying operating systems will correctly show all of the characters required for a locale selector list if the page is encoded correctly (I'd recommend all pages being UTF-8). Ensure that the locale list contains both the native and current-language names to allow both native and non-native speakers to view the specified translations, e.g. "Deutsch (German)" if the current locale is EN-*.

A lot of sites use a flag icon to show the current locale, but this is more relevant to the location and some people may be offended if you show only a dominant flag (e.g. the US or UK flag for English).

It may be worthwhile to have a more visible (semi-graphical) locale selector on the home page if no locale cookie has been submitted, using a combination of GeoIP and Accept-Language to determine the default locale choice.

Semi-related: if your users are in located in different time zones include a zone preference in their account profile for displaying time values in their local time. And store all time stamps using UTC.



Related Topics



Leave a reply



Submit