PHP MySQL Character Set: Storing HTML of International Content

php mysql character set: storing html of international content

MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement

SET NAMES utf8

or use a specific API function such as mysql_set_charset():

mysql_set_charset("utf8", $conn);

If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().

You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:

header('Content-type: text/html;charset=utf-8');

(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)

In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.

Correct PHP method to store special chars in MySQL DB

Use utf8 encoding to store these values.

To avoid injections use mysql_real_escape_string() (or prepared statements).

To protect from XSS use htmlspecialchars.

Resolving incorrect character encoding when displaying MySQL database results after upgrade to PHP 5.3

If you have made sure that both the tables, and the output encoding are UTF-8, almost the only thing left is the connection encoding.

The reason for the change in behaviour when updating servers could be a change of the default connection encoding:

[mysql]
default-character-set=utf8

However, I can't see any changes in the default encoding between versions, so if those were brand-new installs, I can't see that happening.

Anyway, what happens if you run this from within your PHP query and output the results. Any differences to the command line output?

 SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

How can character encoding be made correctly in both php and mysql database

Character set issues are often really tricky to figure out. Basically, you need to make sure that all of the following are true:

  • The DB connection is using UTF-8
  • The DB tables are using UTF-8
  • The individual columns in the DB tables are using UTF-8
  • The data is actually stored properly in the UTF-8 encoding inside the database (often not the case if you've imported from bad sources, or changed table or column collations)
  • The web page is requesting UTF-8
  • Apache is serving UTF-8

Here's a good tutorial on dealing with that list, from start to finish: https://web.archive.org/web/20110303024445/http://www.bluebox.net/news/2009/07/mysql_encoding/

It sounds like your problem is specifically that you've got double-encoded (or triple-encoded) characters, probably from changing character sets or importing already-encoded data with the wrong charset. There's a whole section on fixing that in the above tutorial.

Help with multi-lingual text, php, and mysql

I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.

Strange characters from PHP form. Character set?

It's also good to make sure that the web server is advertising UTF-8, but that's not the culprit here. I use the Live HTTP Headers extension in Firefox to test. MySQL always defaults to the latin-1 character set and you must explicitly set it other wise with mysql_set_charset(). PHP itself it not very good at multi-byte character sets like UTF-8, but as long as it doesn't need to understand those characters (such as regular expression matching) you are safe. You just need to make sure all input and output to the User (via the meta tag) and to the database are aware of the character encoding.

Help with proper character encoding

Total encoding confusion! :-)

The table character set

The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.

  • If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
  • Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
  • The table collation specifies rules for sorting.

The connection character set

The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).

  • The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
  • If this does not match the table encoding, MySQL automatically converts data on the fly.
  • If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.

Page character set

The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.

  • As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.

Recommendations

Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.

However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.

  • I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
  • Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.

Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.

If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.

That entity problem

I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."

This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.

Foreign characters turn into garbage in mysql

Either change your document's header to

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

or - better - change your tables' character set to UTF-8. To do that is not entirely trivial, just changing the tables' collation won't do the trick. This SO question might give some pointers.



Related Topics



Leave a reply



Submit