Extra Backslash Needed in PHP Regexp Pattern

Extra backslash needed in PHP regexp pattern

You need 4 backslashes to represent 1 in regex because:

  • 2 backslashes are used for unescaping in a string ("\\\\" -> \\)
  • 1 backslash is used for unescaping in the regex engine (\\ -> \)

From the PHP doc,

escaping any other character will result in the backslash being printed too1

Hence for \\\[,

  • 1 backslash is used for unescaping the \, one stay because \[ is invalid ("\\\[" -> \\[)
  • 1 backslash is used for unescaping in the regex engine (\\[ -> \[)

Yes it works, but not a good practice.

Backslash in Regex- PHP

The backslash has a special meaning in both regexen and PHP. In both cases it is used as an escape character. For example, if you want to write a literal quote character inside a PHP string literal, this won't work:

$str = ''';

PHP would get "confused" which ' ends the string and which is part of the string. That's where \ comes in:

$str = '\'';

It escapes the special meaning of ', so instead of terminating the string literal, it is now just a normal character in the string. There are more escape sequences like \n as well.

This now means that \ is a special character with a special meaning. To escape this conundrum when you want to write a literal \, you'll have to escape literal backslashes as \\:

$str = '\\'; // string literal representing one backslash

This works the same in both PHP and regexen. If you want to write a literal backslash in a regex, you have to write /\\/. Now, since you're writing your regexen as PHP strings, you need to double escape them:

$regex = '/\\\\/';

One pair of \\ is first reduced to one \ by the PHP string escaping mechanism, so the actual regex is /\\/, which is a regex which means "one backslash".

Right way to escape backslash [ \ ] in PHP regex?

The thing is, you're using a character class, [], so it doesn't matter how many literal backslashes are embedded in it, it'll be treated as a single backslash.

e.g. the following two regexes:

/[a]/
/[aa]/

are for all intents and purposes identical as far as the regex engine is concerned. Character classes take a list of characters and "collapse" them down to match a single character, along the lines of "for the current character being considered, is it any of the characters listed inside the []?". If you list two backslashes in the class, then it'll be "is the char a blackslash or is it a backslash?".

Use preg_replace() to add two backslashes before each match

Welcome to the joys of "leaning toothpick syndrome" - backslash is such a commonly used escape character that it frequently requires escaping multiple times. Let's have a look at your case:

  • Required output (presumably because of some other escaping context): \\
  • Escape each \ with an additional \ for use in the PCRE regex engine: \\\\
  • Escape each \ there for use in a PHP string: \\\\\\\\
$value = 'mercedes-benz';
$pattern = '/(\+|-|\/|&&|\|\||!|\(|\)|\{|}|\[|]|\^|"|~|\*|\?|:|\\\)/';
$replace = '\\\\\\\\${1}';
echo preg_replace($pattern, $replace, $value);

As mickmackusa points out, you can get away with six rather than eight backslashes in some cases, such as a replacement of '\\\\\\'; this works because the regex engine sees \\\, which is an escaped backslash (\\) followed by a single backslash (\) that can't be escaping anything because it's the end of the string. Simply doubling for each "layer" of escaping is probably safer than learning when this short-cut is and isn't valid, though.

Which symbols should be escaped with a backslash in php regex?

There is a list of special Regex characters in the PHP documentation here: http://php.net/manual/en/function.preg-quote.php

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

why 3 backslash equal 4 backslash in php?

$b='/\\\\/';

php parses the string literal (more or less) character by character. The first input symbol is the forward slash. The result is a forward slash in the result (of the parsing step) and the input symbol (one character, the /) is taken away from the input.

The next input symbol is a backslash. It's taken from the input and the next character/symbol is inspected. It's also a backslash. That's a valid combination, so the second symbol is also taken from the input and the result is a single blackslash (for both input symbols).

The same with the third and fourth backslash.

The last input symbol (within the literal) is the forwardslash -> forwardslash in the result.

-> /\\/

Now for the string with three backslashes:

$a='/\\\/';

php "finds" the first blackslash, the next character is a blackslash - that's a valid combination resulting in one single blackslash in the result and both characters in the input literal taken.
php then "finds" the third blackslash, the next character is a forward-slash, this is not a valid combination. So the result is a single blackslash (because php loves and forgives you....) and only one character taken from the input.
The next input character is the forward-slash, resulting in a forwardslash in the result.

-> /\\/

=> both literals encode the same string.

php replace group of double backslash

A literal backslash in PHP single-quoted strings must be declared with 2 backslashes: to print 1|2\|2|3\\|4\\\|4 you need $str = '1|2\\|2|3\\\\|4\\\\\\|4';.

In a regex, the literal backslash can be matched with 4 backslashes.

Here is an updated PHP code:

$str = '1|2\\|2|3\\\\|4\\\\\\|4';
// echo $str . PHP_EOL; => 1|2\|2|3\\|4\\\|4
$r = preg_split('~\\\\.(*SKIP)(*FAIL)|\\|~s', $str);
var_dump($r);

Result:

array(4) {
[0]=>
string(1) "1"
[1]=>
string(4) "2\|2"
[2]=>
string(3) "3\\"
[3]=>
string(6) "4\\\|4"
}

And to obtain **a from \\a you can thus use

$str = '\\\\a';
$r = preg_replace('~\\\\~s', '*', $str);

See another demo

How to properly escape a backslash to match a literal backslash in single-quoted and double-quoted PHP regex patterns

A backslash character (\) is considered to be an escape character by both PHP's parser and the regular expression engine (PCRE). If you write a single backslash character, it will be considered as an escape character by PHP parser. If you write two backslashes, it will be interpreted as a literal backslash by PHP's parser. But when used in a regular expression, the regular expression engine picks it up as an escape character. To avoid this, you need to write four backslash characters, depending upon how you quote the pattern.

To understand the difference between the two types of quoting patterns, consider the following two var_dump() statements:

var_dump('~\\\~');
var_dump("~\\\\~");

Output:

string(4) "~\\~"
string(4) "~\\~"

The escape sequence \~ has no special meaning in PHP when it's used in a single-quoted string. Three backslashes do also work because the PHP parser doesn't know about the escape sequence \~. So \\ will become \ but \~ will remain as \~.

Which one should you use:

For clarity, I'd always use ~\\\\~ when I want to match a literal backslash. The other one works too, but I think ~\\\\~ is more clear.



Related Topics



Leave a reply



Submit