PHP: Regex to Ignore Escaped Quotes Within Quotes

PHP: Regex to ignore escaped quotes within quotes

For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".

This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:

Good:

"([^"\\]|\\.)*"

Version 1: Works correctly but is not terribly efficient.

Better:

"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"

Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).

Best:

"[^"\\]*(?:\\.[^"\\]*)*"

Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)

Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:

$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";

regex - ignore escaped chars within quotation marks

You want here to find an open quote and its closing one, so no escaped quote.

(?<!\\)'.*?(?<!\\)' will do so

Explanation :

(?<! negative lookbehind

\\) escaped backslash and closing lookbehind

' the quote which has not been escaped (the negative look behind has checked it)

.*? any char : .* in lazy mode : ? so the next quote will be evaluate

(?<!\\) again negative lookbehind to check if the quote has been escaped

' Final not escaped quote

Regex - Get strings in Quotes ignore escaped Quotes and Comments

We can use negative lookbehind if you know exacly the length of character before comment with string. Because negative lookbehind cant use quantifier. Something like this :

(?<!\/\/.)".*?[^\\]"

Or do this. Remove all comment that use // with this regex

\/\/.*

then use this to get all strings

".*?[^\\]"

Regex pattern for matching single quoted words in a string and ignore the escaped single quotes

Without this condition, simple...

/('[^']*')/

...would suffice, of course: match all sequences of "single quote, followed by any number of non-single-quote symbols, followed by a single quote again".

But as we need to be ready for two things here - both "normal" and "escaped" ones. So we should add some spice to our pattern:

/('[^'\\]*(?:\\.[^'\\]*)*')/

It might look odd (and it is), but it's actually pretty simple too: match sequences of...

  • single quote symbol...
  • ...followed by zero or more "normal" characters (not ' or \),
  • ...followed by a subexpression of ("escaped" symbol, then zero or more "normal" ones), repeated 0 or more times...
  • followed by a single quote symbol.

Example:

$input   = "City.name = 'New \\' York (And Some Backslash Fun)\\\\'\\'"; 
# ...as \' in any string literal will be parsed as a _single_ quote

$pattern = "/('[^'\\\\]*(?:\\\\.[^'\\\\]*)*')/";
# ... a choice: escape either slashes or single quotes; I choose the former

preg_match($pattern, $input, $token);
echo $token[0]; // 'New \' York (And Some Backslash Fun)\\'

Extracting double quoted strings with escape sequences

If you echo your pattern, you'll see it's indeed passed as %"(?:\"|.)*?"% to the regex parser. The single backslash will be treated as an escape character even by the regex parser.

So you need to add at least one more backslash if the pattern is inside single quotes to pass two backslashes to the parser (one for escaping backlsash) that the pattern will be: %"(?:\\"|.)*?"%

preg_match_all('%"(?:\\\"|.)*?"%', $msg, $matches);

Still this isn't a very efficient pattern. The question seems actually a duplicate of this one.

There is a better pattern available in this answer (what some would call unrolled).

preg_match_all('%"[^"\\\]*(?:\\\.[^"\\\]*)*"%', $msg, $matches);

See demo at eval.in or compare steps with other patterns in regex101.

Regex (PHP) Remove all horizontal whitespace except between quotes ( and '') (include escaped quotes)

You may use

'~(?<!\\\\)(?:\\\\{2})*(?:"[^\\\\"]*(?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\')(*SKIP)(*F)|\h+~s'

See the regex demo

Details

  • (?<!\\)(?:\\{2})*(?:"[^\\"]*(?:\\.[^"\\]*)*"|'[^\\']*(?:\\.[^'\\]*)*')(*SKIP)(*F) - a '...' or "...." substring where the first quotation mark is not itself escaped, which is skipped once matched (so, nothing inside them gets removed)

    • (?<!\\) - no \ char allowed immediately to the left of the current location
    • (?:\\{2})* - zero or more repetitions of double backslashes
    • (?:"[^\\"]*(?:\\.[^"\\]*)*"|'[^\\']*(?:\\.[^'\\]*)*') - either of the two alternatives:

      • "[^\\"]*(?:\\.[^"\\]*)*" - a string literal inside double quotation marks
      • " - a double quote
      • [^\\"]* - 0 or more chars other than \ and "
      • (?:\\.[^"\\]*)*" - zero or more repetitions of a \ followed with any char (\\.) and then any 0 or more chars other than " and \ ([^"\\]*)
      • | - or
      • '[^\\']*(?:\\.[^'\\]*)*' - a string literal inside single quotation marks
    • (*SKIP)(*F) - PCRE verbs that omit the found match and make the regex engine go on searching for a next match starting at the current regex index
  • |\h+ - or 1 or more horizontal whitespaces

PHP demo:

$strs = ['2 + 2', 'f( " ")', 'f("Test \\"mystring\\" .")', 'f("\' ",   " ")'];
$rx = '~(?<!\\\\)(?:\\\\{2})*(?:"[^\\\\"]*(?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\')(*SKIP)(*F)|\h+~s';
print_r( preg_replace($rx, '', $strs) );

Output:

Array
(
[0] => 2+2
[1] => f(" ")
[2] => f("Test \"mystring\" .")
[3] => f("' "," ")
)

How can regex ignore escaped-quotes when matching strings?

<?php
$backslash = '\\';

$pattern = <<< PATTERN
#(["'])(?:{$backslash}{$backslash}?+.)*?{$backslash}1#
PATTERN;

foreach(array(
"<?php \$s = 'Hi everyone, we\\'re ready now.'; ?>",
'<?php $s = "Hi everyone, we\\"re ready now."; ?>',
"xyz'a\\'bc\\d'123",
"x = 'My string ends with with a backslash\\\\';"
) as $subject) {
preg_match($pattern, $subject, $matches);
echo $subject , ' => ', $matches[0], "\n\n";
}

prints

<?php $s = 'Hi everyone, we\'re ready now.'; ?> => 'Hi everyone, we\'re ready now.'

<?php $s = "Hi everyone, we\"re ready now."; ?> => "Hi everyone, we\"re ready now."

xyz'a\'bc\d'123 => 'a\'bc\d'

x = 'My string ends with with a backslash\\'; => 'My string ends with with a backslash\\'

Considering escaped quotes in an all characters except type regex

I would use DOMDocument to do this as it won't care about the actual contents of the attribute as long as they are already valid:

function wrap_js($js) {
$confirm_text = "It will not be possible to modify your responses anymore if you continue.\\n\\nAre you sure you want to continue?";
$new_js_start = 'if( window.confirm("' . $confirm_text . '") ) { ';
$new_js_end = ' } else { event.preventDefault(); }';
return $new_js_start . $js . $new_js_end;
}
$html = "<input type='submit' id='gform_submit_button_4' class='gform_button button' value='Envoyer' onclick='/* Lots of JS */' onkeypress='/* Lots of JS */' />";
$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query("//input[@type='submit']") as $submit_input) {
foreach (['onclick', 'onkeypress'] as $attribute) {
if (($js = $submit_input->getAttribute($attribute)) != '') {
$submit_input->setAttribute($attribute, wrap_js($js));
}
}
}
echo $doc->saveHTML();

Output:

<input type="submit"
id="gform_submit_button_4"
class="gform_button button"
value="Envoyer"
onclick='if( window.confirm("It will not be possible to modify your responses anymore if you continue.\n\nAre you sure you want to continue?") ) { /* Lots of JS */ } else { event.preventDefault(); }'
onkeypress='if( window.confirm("It will not be possible to modify your responses anymore if you continue.\n\nAre you sure you want to continue?") ) { /* Lots of JS */ } else { event.preventDefault(); }'
>

Demo on 3v4l.org

Get html or text from inside quotes including escape quotes with RegEx

You can use negative lookbehind to avoid matching escaped quotes:

(?<!\\)"(.+?)(?<!\\)"

RegEx Demo

Here (?<!\\) is negative lookbehind that will avoid matching \".

However I would caution you on using regex to parse HTML, better to use DOM for that.


PHP Code:

$value_regex = '~(?<!\\\\)"(.+?)(?<!\\\\)"~';
if (preg_match($value_regex, $line, $matches))
$result = $matches[1];


Related Topics



Leave a reply



Submit