Text Mining R Package & Regex to Handle Replace Smart Curly Quotes

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:

> gsub("[“”]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"

See the online R demo. Tested in both Linux and Windows, and works the same.

The [“”] construct is a positive character class that matches any single char defined in the class.

To normalize all chars similar to double quotes, you might want to use

> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\＂″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8"))) 
> cat(res, sep="\n")
You don't get "your" money's worth

Here, [«»““”„‟≪≫《》〝〞〟＂″‶] matches

«   00AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»   00BB  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
“   05F4  HEBREW PUNCTUATION GERSHAYIM
“   201C  LEFT DOUBLE QUOTATION MARK
”   201D  RIGHT DOUBLE QUOTATION MARK
„   201E  DOUBLE LOW-9 QUOTATION MARK
‟   201F  DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪  226A  MUCH LESS-THAN
≫  226B  MUCH GREATER-THAN
《  300A  LEFT DOUBLE ANGLE BRACKET
》  300B  RIGHT DOUBLE ANGLE BRACKET
〝  301D  REVERSED DOUBLE PRIME QUOTATION MARK
〞  301E  DOUBLE PRIME QUOTATION MARK
〟  301F  LOW DOUBLE PRIME QUOTATION MARK
＂  FF02  FULLWIDTH QUOTATION MARK
″   2033  DOUBLE PRIME
‶   2036  REVERSED DOUBLE PRIME

The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:

ʻ  02BB  MODIFIER LETTER TURNED COMMA
ʼ  02BC  MODIFIER LETTER APOSTROPHE
ʽ  02BD  MODIFIER LETTER REVERSED COMMA
٬  066C  ARABIC THOUSANDS SEPARATOR
‘  2018  LEFT SINGLE QUOTATION MARK
’  2019  RIGHT SINGLE QUOTATION MARK
‚  201A  SINGLE LOW-9 QUOTATION MARK
‛  201B  SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚   055A  ARMENIAN APOSTROPHE
︐  FE10  PRESENTATION FORM FOR VERTICAL COMMA

Use gsub to replace curly apostrophe with straight apostrophe in R list of character vectors

You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:

library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)

This will replace the following characters with ASCII apostrophe:

U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE

It will also convert your text to composed normal form (NFC).

How to just remove (\) from string with (\) while keeping ()?

@Ronak Shah, @Chelmy88 and @Konrad Rudolph
helped me to understand where I was wrong in interpretation.

basically, it has to do with the way R renders the string in console.

Solution using cat() can resolve the confusion.

Python: Replace dumb quotation marks with “curly ones” in a string

You can use the HTMLParser to unescape the html entities returned from smartypants:

In [32]: from HTMLParser import HTMLParser

In [33]: s = "“But that gentleman,”"

In [34]: print HTMLParser().unescape(s)
“But that gentleman,”
In [35]: HTMLParser().unescape(s)
Out[35]: u'\u201cBut that gentleman,\u201d'

To avoin encoding errors, you should either use io.open when opening the file and specify encoding="the_encoding" or decode the strings to unicode:

 In [11]: s
Out[11]: '“But that gentleman,”\xe2'

In [12]: print  HTMLParser().unescape(s.decode("latin-1"))
“But that gentleman,”â

Getting linux command syntax with escaped quotes correct from R

If you chain the gsubs, you should pass message variable the second time. However, you may use it like this:

message <- gsub("\"", "\\\"", gsub("\'", "\\\'", input$mailAndStoreModalText, fixed=TRUE), fixed=TRUE)

Or a regex based replacement:

message <- gsub("([\"'])", "\\\\\\1", input$mailAndStoreModalText)

Both will output This\'s the \"best\" music as output.

See the R demo online. Note that cat(message, "\n") command shows you the literal string that message holds, not the string literal that you get when trying to just print message.

Also, the ([\"']) regex matches and captures into Group 1 either a " or ' and the "\\\\\\1" replacement pattern replaces the whole match with \ (that is defined with 4 backslashes) and then the value inside Group 1 (\\1).

Regex to match quote with minimum number of words

You need to "unroll" the character class by taking out the whitespace matching pattern out of it, and use a [<chars>]+(?:\s+[<chars>]+){4,} like pattern. Note you should not use lookarounds here because " can be both a leading and a trailing marker, and that may result in unwanted matches. Use a capturing group instead and access its value via matcher.group(1).

You may use

String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";

See the regex demo.

Then, just grab the Group 1 value:

String line = "Attorney General William Barr said the volume of information compromised was “staggering” and the largest breach in U.S. history.“This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft,” said Mr. Barr.";
String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";
Matcher m = Pattern.compile(regex).matcher(line);
List<String> res = new ArrayList<>();
while(m.find()) {
    res.add(m.group(1));
}
System.out.println(res);

See the online Java demo.

Pattern details

[“"] - “ or "
([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,}) - Group 1:
- [A-Za-z0-9.-][A-Za-z,:’]* - an ASCII alphanumeric or . or - and then 0+ of ASCII letters, ,, :, ’ chars
- (?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,} - four or more occurrences of
  - \s+ - 1+ whitespaces
  - - [A-Za-z0-9.-][A-Za-z,:’]* - an ASCII alphanumeric or . or - and then 0+ of ASCII letters, ,, :, ’ chars
[”"] - " or ”

regex for even no. of single quotes

We can use the following regex (count only ' not preceded by ").

\bAND\b(?=(?:(?:[^']*[^'"]'){2})*[^']*$)

Regular expression visualization

Debuggex Demo

Text Mining R Package & Regex to Handle Replace Smart Curly Quotes