Convert All Types of Smart Quotes with PHP

Convert all types of smart quotes with PHP

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(
// Windows codepage 1252
"\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
"\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
"\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
"\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
"\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
"\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
"\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
"\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark
"\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark
"\xE2\x80\x98" => "'", // U+2018 left single quotation mark
"\xE2\x80\x99" => "'", // U+2019 right single quotation mark
"\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
"\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
"\xE2\x80\x9C" => '"', // U+201C left double quotation mark
"\xE2\x80\x9D" => '"', // U+201D right double quotation mark
"\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
"\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
"\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
"\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

Here comes the background:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

  • Ps "Punctuation, Open"
  • Pe "Punctuation, Close"
  • Pi "Punctuation, Initial quote (may behave like Ps or Pe depending on usage)"
  • Pf "Punctuation, Final quote (may behave like Ps or Pe depending on usage)"
  • Po "Punctuation, Other"

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

It is sometimes useful to match these categories in a Unicode-enabled regex.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

Note: strtr() is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.


If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\\X*$/u', $str)) {
$str = utf8_encode($str);
}

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="Gru\xDF\x85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.


If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(
"\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
"\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
"\xC2\x83" => "\xC6\x92", // U+0192 latin small letter f with hook
"\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
"\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
"\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
"\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
"\xC2\x88" => "\xCB\x86", // U+02C6 modifier letter circumflex accent
"\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
"\xC2\x8A" => "\xC5\xA0", // U+0160 latin capital letter s with caron
"\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
"\xC2\x8C" => "\xC5\x92", // U+0152 latin capital ligature oe
"\xC2\x8E" => "\xC5\xBD", // U+017D latin capital letter z with caron
"\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
"\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
"\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
"\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
"\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
"\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
"\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
"\xC2\x98" => "\xCB\x9C", // U+02DC small tilde
"\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
"\xC2\x9A" => "\xC5\xA1", // U+0161 latin small letter s with caron
"\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
"\xC2\x9C" => "\xC5\x93", // U+0153 latin small ligature oe
"\xC2\x9E" => "\xC5\xBE", // U+017E latin small letter z with caron
"\xC2\x9F" => "\xC5\xB8", // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

Can't replace a smart quote in POST request with PHP

Upon testing (and I had my doubts about it being an encoding issue; I accidentally deleted my comment about that), I was able to find out why your code is failing.

It's because your file's encoding may be set to UTF-8 without BOM.

If that is the case, change it to be with BOM (byte order mark) and it will work as expected.

Reference:

  • https://en.wikipedia.org/wiki/Byte_order_mark

Nota:

Saving the file as ANSI encoding, did also replace the curly quote with a regular quote, so you have a choice. As ANSI, or UTF-8 with BOM.

You can use an editor such as Notepad++ for this.

  • https://notepad-plus-plus.org/

From the dropdown menu, you would choose:

  • Encoding, Convert to UTF-8 with BOM, then save.
  • Or, Encoding, Convert to ANSI, then save.
  • The choice is yours.

Important sidenote: Do not choose "Encode in...", because that will not convert your file once you save it. You must choose "Convert to".

There are other code editors out there that you can use which will give you the same result.

How to replace Microsoft-encoded quotes in PHP

Considering you only want to replace a few specific and well identified characters, I would go for str_replace with an array: you obviously don't need the heavy artillery regex will bring you ;-)

And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.



The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP

And the associated code (quoting that page):

function convert_smart_quotes($string) 
{
$search = array(chr(145),
chr(146),
chr(147),
chr(148),
chr(151));

$replace = array("'",
"'",
'"',
'"',
'-');

return str_replace($search, $replace, $string);
}

(I don't have Microsoft Word on this computer, so I can't test by myself)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...

How do I convert Word smart quotes and em dashes in a string?

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

how to convert this to htmlcharacters: “ and ”

This answer comes from http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php, which was posted as a comment by Tomalak Geret'kal:

The trick is to use the ascii character position to find the characters that we are targeting, and then convert them with the ones we can work with. Eg:

<?php
function convert_smart_quotes($string)
{
$search = array(chr(145),
chr(146),
chr(147),
chr(148),
chr(151));

$replace = array("'",
"'",
'"',
'"',
'-');

return str_replace($search, $replace, $string);
}
?>

smart quotes not converting properly into UTF8

If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.

As long as you do not use "special" characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the "smart quotes" — it will break because these are represented by different bytes in UTF-8.

In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:

'<?xml encoding="Windows-1252"?><article>'.file_get_contents($xmlfile).'</article>'

This way you instruct the DOMDocument how to interpret the bytes in your string. I assumed Windows-1252 for you because you said ANSI and mentioned the curly quotes.

In fact, 95% of the time this is what people really mean, even on Linux and even if they say ISO-8859-1 (or latin-1), which is almost, but not exactly the same thing.

To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For Windows-1252. For the curly quotes the expected byte values would be:

  • 147 (0x93)
  • 148 (0x94)

Once the meaning of the individual bytes in your string is declared, DOMDocument can make sense of them and does the right thing.

When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don't know enough about PHP/mySQL/Unicode integration to say for sure.

Can I use iconv to convert multi-byte smart quotes to extended ASCII smart quotes?

You're looking for CP-1252 which contains "curly quotes" at 0x91-0x94 (145-148).

$content = iconv("UTF-8", "cp1252//TRANSLIT", $content);

With PHP and MySQL, how do I properly write smart quotes to the database?

First, make sure your MySQL table is using UTF-8 as its encoding. If it is, it will look like this:

mysql> SHOW CREATE TABLE Users (
...
) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8 |

Next, make sure your HTML page is set to display UTF-8:

<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
</head>
....
</html>

Then it should work.


EDIT: I purposefully did not talk about collation, because I thought it was already considered, but for the benefit of everyone, let me add some more to this answer.

You state,

I have the charset set to UTF-8 … in the MySQL table collation.

Table collation is not the same thing as charset.

Collation is the act of automagically trying to convert one charset to another FOR THE PURPOSES OF QUERYING. E.g., if you have a charset of latin1 and a collation of UTF-8, and you do something like SELECT * FROM foo WHERE bar LIKE '%—%'; (UTF-8 U+2014) on a table with a charset of latin1 that match either L+0151 or U+2014.

Not so coincidentally... if you were output this latin1 encoded character onto a UTF-8 encoded web page, you will get the following:

This is a “testâ€.

That seems to be the output of your problem, exactly. Here's the HTML to duplicate it:

<?php
$string = "This is a “test”.";
?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf8"/>
</head>
<body>
<p><?php echo $string; ?></p>
</body>
</html>

Make sure you save this file in latin1...

To see what charset your table is set to, run this query:

SELECT CCSA.character_set_name, TABLE_COLLATION FROM information_schema.`TABLES` T,
information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA
WHERE CCSA.collation_name = T.table_collation
AND T.table_schema = "database"
AND T.table_name = "table";

The only proper results for your uses (unless you're using multiple non-English languages) is:

+--------------------+-----------------+
| character_set_name | TABLE_COLLATION |
+--------------------+-----------------+
| utf8 | utf8_general_ci |
+--------------------+-----------------+

Thanks for the upvotes ;-)



Related Topics



Leave a reply



Submit