php mysql character set: storing html of international content
MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement
SET NAMES utf8
or use a specific API function such as mysql_set_charset():
mysql_set_charset("utf8", $conn);
If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().
You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:
header('Content-type: text/html;charset=utf-8');
(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)
In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.
Correct PHP method to store special chars in MySQL DB
Use utf8
encoding to store these values.
To avoid injections use mysql_real_escape_string()
(or prepared statements).
To protect from XSS use htmlspecialchars
.
Resolving incorrect character encoding when displaying MySQL database results after upgrade to PHP 5.3
If you have made sure that both the tables, and the output encoding are UTF-8, almost the only thing left is the connection encoding.
The reason for the change in behaviour when updating servers could be a change of the default connection encoding:
[mysql]
default-character-set=utf8
However, I can't see any changes in the default encoding between versions, so if those were brand-new installs, I can't see that happening.
Anyway, what happens if you run this from within your PHP query and output the results. Any differences to the command line output?
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
How can character encoding be made correctly in both php and mysql database
Character set issues are often really tricky to figure out. Basically, you need to make sure that all of the following are true:
- The DB connection is using UTF-8
- The DB tables are using UTF-8
- The individual columns in the DB tables are using UTF-8
- The data is actually stored properly in the UTF-8 encoding inside the database (often not the case if you've imported from bad sources, or changed table or column collations)
- The web page is requesting UTF-8
- Apache is serving UTF-8
Here's a good tutorial on dealing with that list, from start to finish: https://web.archive.org/web/20110303024445/http://www.bluebox.net/news/2009/07/mysql_encoding/
It sounds like your problem is specifically that you've got double-encoded (or triple-encoded) characters, probably from changing character sets or importing already-encoded data with the wrong charset. There's a whole section on fixing that in the above tutorial.
Help with multi-lingual text, php, and mysql
I don't think you have any practical alternatives to UTF-8. You're going to have to track down where the encoding and/or decoding breaks. Start by checking whether you can round-trip multi-language text to the data base from the mysql command line, or perhaps through phpmyadmin. Track down and eliminate problems at that level. Then move out one more level by simulating input to your php and examining the output, again dealing with any problems. Finally add browsers into the mix.
Strange characters from PHP form. Character set?
It's also good to make sure that the web server is advertising UTF-8, but that's not the culprit here. I use the Live HTTP Headers extension in Firefox to test. MySQL always defaults to the latin-1 character set and you must explicitly set it other wise with mysql_set_charset(). PHP itself it not very good at multi-byte character sets like UTF-8, but as long as it doesn't need to understand those characters (such as regular expression matching) you are safe. You just need to make sure all input and output to the User (via the meta tag) and to the database are aware of the character encoding.
Help with proper character encoding
Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
- If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
- Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
- The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
- The encoding is set using SET NAMES, e.g.
SET NAMES "utf8"
. - If this does not match the table encoding, MySQL automatically converts data on the fly.
- If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
- As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
- I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
- Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8"
after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.
Foreign characters turn into garbage in mysql
Either change your document's header to
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
or - better - change your tables' character set to UTF-8. To do that is not entirely trivial, just changing the tables' collation won't do the trick. This SO question might give some pointers.
Related Topics
Phpunit: Doing Assertions on Non-Public Variables
"Transfer-Encoding: Chunked" Header in PHP
Phpdoc for Variable-Length Arrays of Arguments
Get Values Stdclass Object PHP
Why Does PHP's Call_User_Func() Function Not Support Passing by Reference
How to Curry Method Calls in PHP
Differencebetween .= and += in PHP
Is This Mail() Function Safe from Header Injection
PHP - Is There a Portable Version of PHPunit
How to Use Multiple PHP Header Content Types on the Same Page? Is This Possible
Php: Check If Xml Node Exists with Attribute
Codeigniter Command Line Error - PHP Fatal Error: Class 'Ci_Controller' Not Found
Html5 Input Type File's Multiple Attribute Not Working in Ie
Elegant Way to Search an PHP Array Using a User-Defined Function
Difference(When Being Applied to My Code) Between Int(10) and Int(12)
Defining a Namespace for Laravel 8 Routes
PHP Code Can Insert Image to Excel File and Open It Correctly in Ms Excel