What Does [\S\S]* Mean in Regex in PHP

What does [\S\s]* mean in regex in PHP?

By default . doesn't match new lines - [\s\S] is a hack around that problem.

This is common in JavaScript, but in PHP you can use the /s flag to to make the dot match all characters.

Is [\s\S] same as . (dot)?

"No" it is not the same. It has an important difference if you are not using the single line flag (meaning that . does not match all).

The [\s\S] comes handy when you want to do mix of matches when the . does not match all.

It is easier to explain it with an example. Suppose you want to capture whatever is between a and b, so you can use pattern a(.*?)b (? is for ungreedy matches and parentheses for capturing the content), but if there are new lines suppose you don't want to capture this in the same group, so you can have another regex like a([\s\S]*?)b.

Therefore if we create one pattern using both approaches it results in:

a(.*)b|a([\s\S]*?)b

Sample Image

In this case, if you see the scenario in regex101, then you will have a colorful and easy way to differentiate the scenarios (in green capturing group #1 and in red capturing group #2):
Sample Image

So, in conclusion, the [\s\S] is a regex trick when you want to match multiple lines and the . does not suit your needs. It basically depends on your use case.

However, if you use the single line flag where . matches new lines, then you don't need the regex trick, below you can see that all is green and group 2 (red above) is not matched:Sample Image

Have also created a javascript performance test and it impacts in the performance around 25%:

https://jsperf.com/ss-vs-dot

Sample Image

PHP Regex: Difference between \s and \\s

Trying to make sense of the text from the manual.

http://php.net/manual/en/regexp.reference.escape.php

Single and double quoted PHP strings have special meaning of
backslash.
Thus if \ has to be matched with a regular expression \\, then "\\\\"
or '\\\\' must be used in PHP code.

I might not be correct but here I go.

When you use something like

preg_match("/\\s/", $myString)

What it is doing is converting \\ to \, which in turns make the string to be \s thus it behaves normally i.e its meaning doesn't change and the created regex is '/\s/' internally which matches "spaces"

To match \s in a string you would have to do something as follows

preg_match("/\\\\s/", $myString)

So the answer is \s or \\s in the regex string doesn't make any difference, personally I think using \s is simpler and easy to understand.

Regex to match a word, even is there are spaces between letters

This is the only way to match such words. You have to consume these spaces, otherwise you won't have a match. Actually, your pattern is the same as

t\s*e\s*s\s*t

If the word appears inside a larger string, you can consider a word boundary version:

\bt\s*e\s*s\s*t\b

NOTE: If only one whitespace is allowed in between each letter, you can use ? quantifier instead of *:

t\s?e\s?s\s?t

What is the meaning of [\w\-] regular expression in PHP

Regex 101

\w explained

\w match any word character [a-zA-Z0-9_]

\w\- explained

\w\-
\w match any word character [a-zA-Z0-9_]
\- matches the character - literally

Matching Email Addresses Simple, not future proof

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b

How to understand the output of the PHP regex matching below?

The reason is that group 12 (^(\{)) matches successfully, and therefore the regex engine stops. It never even gets to try to match group 16 (^([\S\s]+)).

If you put group 16 before group 12, then it will match the entire string. However, since [\s\S] matches any character at all (and can be abbreviated to . when using the s modifier), then none of the other groups will ever match.

All in all, your regex looks quite weird. I wouldn't be surprised if it contained a few bugs aside from the obvious warts like trying to put several spaces into a single character class.

Reference - What does this regex mean?

The Stack Overflow Regular Expressions FAQ

See also a lot of general hints and useful links at the regex tag details page.


Online tutorials

  • RegexOne ↪
  • Regular Expressions Info ↪

Quantifiers

  • Zero-or-more: *:greedy, *?:reluctant, *+:possessive
  • One-or-more: +:greedy, +?:reluctant, ++:possessive
  • ?:optional (zero-or-one)
  • Min/max ranges (all inclusive): {n,m}:between n & m, {n,}:n-or-more, {n}:exactly n
  • Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier:
    • Greedy vs. Reluctant vs. Possessive Quantifiers
    • In-depth discussion on the differences between greedy versus non-greedy
    • What's the difference between {n} and {n}?
    • Can someone explain Possessive Quantifiers to me? php, perl, java, ruby
    • Emulating possessive quantifiers .net
    • Non-Stack Overflow references: From Oracle, regular-expressions.info

Character Classes

  • What is the difference between square brackets and parentheses?
  • [...]: any one character, [^...]: negated/any character but
  • [^] matches any one character including newlines javascript
  • [\w-[\d]] / [a-z-[qz]]: set subtraction .net, xml-schema, xpath, JGSoft
  • [\w&&[^\d]]: set intersection java, ruby 1.9+
  • [[:alpha:]]:POSIX character classes
  • [[:<:]] and [[:>:]] Word boundaries
  • Why do [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java? java
  • Shorthand:
    • Digit: \d:digit, \D:non-digit
    • Word character (Letter, digit, underscore): \w:word character, \W:non-word character
    • Whitespace: \s:whitespace, \S:non-whitespace
  • Unicode categories (\p{L}, \P{L}, etc.)

Escape Sequences

  • Horizontal whitespace: \h:space-or-tab, \t:tab
  • Newlines:
    • \r, \n:carriage return and line feed
    • \R:generic newline php java-8
  • Negated whitespace sequences: \H:Non horizontal whitespace character, \V:Non vertical whitespace character, \N:Non line feed character pcre php5 java-8
  • Other: \v:vertical tab, \e:the escape character

Anchors








































































anchormatchesflavors
^Start of stringCommon*
^Start of lineCommonm
$End of lineCommonm
$End of textCommon* except javascript
$Very end of stringjavascript*, phpD
\AStart of stringCommon except javascript
\ZEnd of textCommon except javascript python
\ZVery end of stringpython
\zVery end of stringCommon except javascript python
\bWord boundaryCommon
\BNot a word boundaryCommon
\GEnd of previous matchCommon except javascript, python

preg_match: and-not expression

My suggested regex does not require ! with preg_match, and matches only a multiline string without both "lorem" and "ipsum" in any upper or lower case and as whole words:

^(?si)(?!.*?\bipsum\b.*$)(?!.*\blorem\b.*\bipsum\b|\bipsum\b.*\blorem\b.*$).*$

(?si) sets case-insensitive and singleline modes so that . could match a newline and match both "Lorem" and "lorem". \bs are used to match whole words only. It will also fail a string that has ipsum (a second condition (?!.*?\bipsum\b.*$) is set to handle that).

See demo here

$re = "/^(?si)(?!.*?\\bipsum\\b.*$)(?!.*\\blorem\\b.*\\bipsum\\b|\\bipsum\\b.*\\blorem\\b.*$).*$/"; 
$str = "dolor lorem sit amet, consectetur adipisicing elit, \nsed do eiusmod tempor incididunt ut labore et dolore magna aliqua.";
if (preg_match($re, $str, $matches)) {
...
}

PHP preg_replace() does not match white spaces and new lines: why?

Add + after <!--[\s\r\n]* in your pattern :

$pattern = '<!--[\s\r\n]*+[^![].*-->';

$string = '<!-- one line comment -->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!--
multiple line comment
-->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!-- ! one line comment -->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!--! one line comment -->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!--!
multiple line comment
-->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!-- !
multiple line comment
-->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!--
!multiple line comment
-->';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

$string = '<!--[if lt IE 9]>';
var_dump(preg_replace('/' . $pattern . '/Uis', '', $string));

Output :

string '' (length=0)

string '' (length=0)

string '<!-- ! one line comment -->' (length=27)

string '<!--! one line comment -->' (length=26)

string '<!--!

multiple line comment

-->' (length=33)

string '<!-- !

multiple line comment

-->' (length=34)

string '<!--

!multiple line comment

-->' (length=33)

string '<!--[if lt IE 9]>' (length=17)

RegEx for capturing a repeating pattern

You can make this work with only minor changes, but the issue is that last part. The general form of a tempered greedy token is this:

(.(?!notAllowed))+

so, using this pattern for your case, plus adding named groups for clarity:

(?<hours>[0-9]{1,2}h)[ ]*(?<minutes>[0-9]{1,2}min):\s*(?<description>(?:.(?!\dh\s\d{1,2}min))+)

PS: if you cannot turn on a "dot matches newline" mode, you may be able to use [\s\S] to simulate.

regex101 demo



Related Topics



Leave a reply



Submit