How to Convert Word Smart Quotes and Em Dashes in a String

How do I convert Word smart quotes and em dashes in a string?

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

Is there a category or name for characters like smart quotes and that dash that always breaks?

There are at least 1,114,111 valid Unicode code points. My US-standard keyboard makes those that fall between 1 and 127 (base 10) reasonably easy to access.

When you venture beyond that range you start getting into either old style locales, or more modern UTF8 (or other Unicode) code points. Many of these code points are easily accessible from a keyboard somewhere in the world. But from the comfort of your own home or office, you'll find a fairly small subset of those 1.1 million to be easily accessible from your keyboard.

There is a Unicode property called QMark (the short name), or Quotation_Mark (the long name), that includes 29 quotation style code points (in UTF8, hex): 0x0022, 0x0027, 0x00ab, 0x00bb, 0x2018, 0x2019, 0x201a, 0x201b, 0x201c, 0x201d, 0x201e, 0x201f, 0x2039, 0x203a, 0x300c, 0x300d, 0x300e, 0x300f, 0x301d, 0x301e, 0x301f, 0xfe41, 0xfe42, 0xfe43, 0xfe44, 0xff02, 0xff07, 0xff62, and 0xff63.

Here's how they look (assuming your fonts support them all):

"'«»‘’‚‛“”„‟‹›「」『』〝〞〟﹁﹂﹃﹄"'「」

There happens to be a Unicode property ASCII, which not surprisingly contains 128 code points between 0 and 127.

I can't seem to find a Unicode property that specifies "Everything that is not ASCII", but you will know it by virtue of the fact that it falls outside of the 0 .. 127 range.

There is also a Hyphen Unicode property that contains eleven code points: 0x002d, 0x00ad, 0x058a, 0x1806, 0x2010, 0x2011, 0x2e17, 0x30fb, 0xfe63, 0xff0d, and 0xff65. I'm reluctant to paste them all here, as at least two of them don't render in my terminal. But here goes:

-­֊᠆‐‑⸗・﹣-・

As you can see, some are indistinguishable from others. When I use the Hyphen property in Perl 5.16 I get a warning that the particular Unicode property is deprecated. I don't know if that's just for Perl, or if it's for Unicode in general.

There is also a Dash property containing 27 code points. I think you get the idea, so I won't enumerate them here. ...and another named Dash_Punctuation with 23 code points. Note that many code points can be categorized by more than one Unicode property, so it's possible that there is overlap between Hyphen and Dash, and probably even more overlap between Dash and Dash_Punctuation -- I don't know and haven't checked.

I know this isn't a Perl-centric question by any means, but I've found that Perl has pretty good documentation of the Unicode properties here: perldoc perluniprops.

So I guess the short answer to the question, "Are there more?" is yes, there are about 1.1 million more.

Update: Regarding what these pesky characters are called.... You sort of have to differentiate between code points and glyphs. A code point is the unambiguous representation of a Unicode entity, whereas the glyph is what it looks like. Different fonts may implement a given glyph differently from each other. So what looks the same in one font may look a little different in another. Start thinking of Unicode code points, and their associated full names as having semantic meaning, whereas glyphs are simple graphical (unreliable) representations.

Update 2: In some programming languages (specifically Perl, but possibly others) you may create custom character classes using set logic. In Perl, these are referred to as Extended Bracketed Character Classes, and are discussed in perldoc perlrecharclass. If you wanted to match all quotes that are not within the ASCII range, you could use this subexpression:

(?[\p{QMark}-\p{ASCII}])

The subexpression above creates a character class that matches all quote-like marks excluding those that come from the ASCII range. This is a feature that was introduced to Perl in Perl version 5.18. Given that this "Update 2" was added in 2019, and Perl 5.18 was released in 2013, the feature has been available for roughly four years. Unfortunately I find no indication that it has found its way into the PCRE libraries outside of Perl.

Though it has been around for four years already, this feature (as of Perl 5.28) is still marked 'experimental'. Therefore, to use it you should add the following pragma in the scope where it is used:

no warnings qw(experimental::regex_sets);

That will squelch the experimental warning. I would not be surprised to see that warning lifted in a near-future release of Perl.

How do I read and write smart quotes (and other silly characters) in C#

TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)


Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?

This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.

Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.

So read it as Windows-1252 and you're half-way there:

Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(@"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now

Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:

Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(@"C:\myFile", sb.toString(), windows1252);

replace MSWord smart quotes in asp.net webform

These smart quotes are a unicode point. All you need is a simple String.Replace to sort them out.

-edit- Something like:

mystring.Replace("\u201C","\"").Replace("\u201D","\"")

Convert all types of smart quotes with PHP

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(
// Windows codepage 1252
"\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
"\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
"\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
"\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
"\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
"\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
"\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
"\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark
"\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark
"\xE2\x80\x98" => "'", // U+2018 left single quotation mark
"\xE2\x80\x99" => "'", // U+2019 right single quotation mark
"\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
"\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
"\xE2\x80\x9C" => '"', // U+201C left double quotation mark
"\xE2\x80\x9D" => '"', // U+201D right double quotation mark
"\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
"\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
"\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
"\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

Here comes the background:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

  • Ps "Punctuation, Open"
  • Pe "Punctuation, Close"
  • Pi "Punctuation, Initial quote (may behave like Ps or Pe depending on usage)"
  • Pf "Punctuation, Final quote (may behave like Ps or Pe depending on usage)"
  • Po "Punctuation, Other"

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

It is sometimes useful to match these categories in a Unicode-enabled regex.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

Note: strtr() is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.


If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\\X*$/u', $str)) {
$str = utf8_encode($str);
}

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="Gru\xDF\x85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.


If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(
"\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
"\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
"\xC2\x83" => "\xC6\x92", // U+0192 latin small letter f with hook
"\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
"\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
"\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
"\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
"\xC2\x88" => "\xCB\x86", // U+02C6 modifier letter circumflex accent
"\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
"\xC2\x8A" => "\xC5\xA0", // U+0160 latin capital letter s with caron
"\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
"\xC2\x8C" => "\xC5\x92", // U+0152 latin capital ligature oe
"\xC2\x8E" => "\xC5\xBD", // U+017D latin capital letter z with caron
"\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
"\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
"\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
"\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
"\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
"\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
"\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
"\xC2\x98" => "\xCB\x9C", // U+02DC small tilde
"\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
"\xC2\x9A" => "\xC5\xA1", // U+0161 latin small letter s with caron
"\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
"\xC2\x9C" => "\xC5\x93", // U+0153 latin small ligature oe
"\xC2\x9E" => "\xC5\xBE", // U+017E latin small letter z with caron
"\xC2\x9F" => "\xC5\xB8", // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

How to replace Microsoft-encoded quotes in PHP

Considering you only want to replace a few specific and well identified characters, I would go for str_replace with an array: you obviously don't need the heavy artillery regex will bring you ;-)

And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.



The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP

And the associated code (quoting that page):

function convert_smart_quotes($string) 
{
$search = array(chr(145),
chr(146),
chr(147),
chr(148),
chr(151));

$replace = array("'",
"'",
'"',
'"',
'-');

return str_replace($search, $replace, $string);
}

(I don't have Microsoft Word on this computer, so I can't test by myself)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...



Related Topics



Leave a reply



Submit