Which Is the Best Character Encoding for Japanese Language for Db, PHP, and HTML Display

Which is the best character encoding for Japanese language for DB, php, and html display?

UTF-8 without a doubt. Make everything UTF-8. To put UTF-8 encoded text on your web page, use this within your HEAD tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

As for MySQL, put the following into your my.cnf (config) file:

[mysqld]
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8
default-collation=utf8_general_ci
collation-server=utf8_general_ci

If you're getting garbage characters from the database from queries executed by your application, you might need to execute these two queries before fetching your Japanese text:

SET NAMES utf8
SET CHARACTER SET utf8

Which is the best character encoding for Japanese language for DB, php, and html display?

UTF-8 without a doubt. Make everything UTF-8. To put UTF-8 encoded text on your web page, use this within your HEAD tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

As for MySQL, put the following into your my.cnf (config) file:

[mysqld]
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8
default-collation=utf8_general_ci
collation-server=utf8_general_ci

If you're getting garbage characters from the database from queries executed by your application, you might need to execute these two queries before fetching your Japanese text:

SET NAMES utf8
SET CHARACTER SET utf8

How to set schema collation in MySQL for Japanese

That's like "uppercase" and "lowercase", correct?

mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_general_ci;
+---------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_general_ci |
+---------------------------------------+
| 0 |
+---------------------------------------+

mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_unicode_ci;
+---------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_unicode_ci |
+---------------------------------------+
| 1 |
+---------------------------------------+

mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_unicode_520_ci;
+-------------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_unicode_520_ci |
+-------------------------------------------+
| 1 |
+-------------------------------------------+

I recommend changing your column to be COLLATION utf8_unicode_520_ci (or utf8mb4_unicode_520_ci).

If you expect to be including Chinese, then be sure to use utf8mb4. (Perhaps this advice applies to Kanji, too.)

Help with proper character encoding

Total encoding confusion! :-)

The table character set

The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.

  • If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
  • Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
  • The table collation specifies rules for sorting.

The connection character set

The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).

  • The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
  • If this does not match the table encoding, MySQL automatically converts data on the fly.
  • If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.

Page character set

The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.

  • As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.

Recommendations

Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.

However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.

  • I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
  • Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.

Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.

If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.

That entity problem

I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."

This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.

How to display Japanese characters on a php page?

Since you've stated that it is working in your development environment and not in your live, you might want to check Apache's AddDefaultCharset and set this to UTF-8, if it's not already.

I tend to make sure the following steps are checked

  1. PHP Header is sent in UTF-8
  2. Meta tag is set to UTF-8 (Content-Type)
  3. Storage is set to UTF-8
  4. Server output is set to UTF-8

That seems to work for me. Hope this helps.



Related Topics



Leave a reply



Submit