What Does the "[^][]" Regex Mean

What does the regex [^\s]*? mean?

Alright, so to answer your first question, I'll break down [^\s]*?.

  • The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.

  • \s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.

  • *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.

In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.

To answer the second part of your question, I'll compare the two regexes you give:

http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)

They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.

Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like

http:foo.bar.png
http:.png

Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:

http:// .jpg
http://foo bar.png

Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:

https?://\S+\.(jpe?g|png|gif)

In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Reference - What does this regex mean?

The Stack Overflow Regular Expressions FAQ

See also a lot of general hints and useful links at the regex tag details page.


Online tutorials

  • RegexOne ↪
  • Regular Expressions Info ↪

Quantifiers

  • Zero-or-more: *:greedy, *?:reluctant, *+:possessive
  • One-or-more: +:greedy, +?:reluctant, ++:possessive
  • ?:optional (zero-or-one)
  • Min/max ranges (all inclusive): {n,m}:between n & m, {n,}:n-or-more, {n}:exactly n
  • Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier:
    • Greedy vs. Reluctant vs. Possessive Quantifiers
    • In-depth discussion on the differences between greedy versus non-greedy
    • What's the difference between {n} and {n}?
    • Can someone explain Possessive Quantifiers to me? php, perl, java, ruby
    • Emulating possessive quantifiers .net
    • Non-Stack Overflow references: From Oracle, regular-expressions.info

Character Classes

  • What is the difference between square brackets and parentheses?
  • [...]: any one character, [^...]: negated/any character but
  • [^] matches any one character including newlines javascript
  • [\w-[\d]] / [a-z-[qz]]: set subtraction .net, xml-schema, xpath, JGSoft
  • [\w&&[^\d]]: set intersection java, ruby 1.9+
  • [[:alpha:]]:POSIX character classes
  • [[:<:]] and [[:>:]] Word boundaries
  • Why do [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java? java
  • Shorthand:
    • Digit: \d:digit, \D:non-digit
    • Word character (Letter, digit, underscore): \w:word character, \W:non-word character
    • Whitespace: \s:whitespace, \S:non-whitespace
  • Unicode categories (\p{L}, \P{L}, etc.)

Escape Sequences

  • Horizontal whitespace: \h:space-or-tab, \t:tab
  • Newlines:
    • \r, \n:carriage return and line feed
    • \R:generic newline php java-8
  • Negated whitespace sequences: \H:Non horizontal whitespace character, \V:Non vertical whitespace character, \N:Non line feed character pcre php5 java-8
  • Other: \v:vertical tab, \e:the escape character

Anchors








































































anchormatchesflavors
^Start of stringCommon*
^Start of lineCommonm
$End of lineCommonm
$End of textCommon* except javascript
$Very end of stringjavascript*, phpD
\AStart of stringCommon except javascript
\ZEnd of textCommon except javascript python
\ZVery end of stringpython
\zVery end of stringCommon except javascript python
\bWord boundaryCommon
\BNot a word boundaryCommon
\GEnd of previous matchCommon except javascript, python

Regex Explanation ^.*$

  • ^ matches position just before the first character of the string
  • $ matches position just after the last character of the string
  • . matches a single character. Does not matter what character it is, except newline
  • * matches preceding match zero or more times

So, ^.*$ means - match, from beginning to end, any character that appears zero or more times. Basically, that means - match everything from start to end of the string. This regex pattern is not very useful.

Let's take a regex pattern that may be a bit useful. Let's say I have two strings The bat of Matt Jones and Matthew's last name is Jones. The pattern ^Matt.*Jones$ will match Matthew's last name is Jones. Why? The pattern says - the string should start with Matt and end with Jones and there can be zero or more characters (any characters) in between them.

Feel free to use an online tool like https://regex101.com/ to test out regex patterns and strings.

What does [^.]* mean in regular expression?

Within the [] the . means just a dot. And the leading ^ means "anything but ...".

So [^.]* matches zero or more non-dots.

What does the regex /\\*{2,}/ mean?

That regex is invalid syntax.

You have this piece:

*{2,}

Which basically would read: match n-times, 2 or more times.


The following regex:

/\\*.{2,}/

Is the simplest and closest regex to the one you have, which would read as:

match 0 or more '\' and 2 or more characters that aren't newlines

If you are talking about the string itself, is may be interpreted as 2 things:

  • /\\*{2,}/

    Read as: match a single \ and another \ n-times 2 times or more

    This is invalid syntax
  • /\*{2,}\

    Read as match 2 or more *
    This is valid syntax

It all varies, depending on the escape character.


Edit:

Since the question was updated to show which language and engine it is being used, I've updated to add the following information:

You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).

Both are very similar, except that single quotes ('') only support the following escape sequences:

  • \' - Produces '
  • \\- Produces \

Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:

  • \" - Produces "
  • \' - Produces '
  • \\ - Produces \
  • \x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
  • \0 - Produces a null char
  • \1 - Produces a control char (same as chr(1))
  • \u<4-digit hex number> - Produces an UTF-8 character
  • \r - Produces a newline on old OSX
  • \n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
  • \t - Produces a tab
  • \<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
  • ... (some more that I probably forgot) ...
  • \<anything> - Produces <anything>

Read more about this on https://php.net/manual/en/language.types.string.php

Meaning of regular expressions like - \\d , \\D, ^ , $ etc

From ?regexp, in the Extended Regular Expressions section:

The caret ‘^’ and the dollar sign ‘$’ are metacharacters that
respectively match the empty string at the beginning and end of a
line. The symbols ‘\<’ and ‘>’ match the empty string at the
beginning and end of a word. The symbol ‘\b’ matches the empty
string at either edge of a word, and ‘\B’ matches the empty string
provided it is not at an edge of a word. (The interpretation of
‘word’ depends on the locale and implementation: these are all
extensions.)

From Perl-like Regular Expressions:

The escape sequences ‘\d’, ‘\s’ and ‘\w’ represent any decimal
digit, space character and ‘word’ character (letter, digit or
underscore in the current locale: in UTF-8 mode only ASCII letters
and digits are considered) respectively, and their upper-case
versions represent their negation. Vertical tab was not regarded
as a space character in a ‘C’ locale before PCRE 8.34 (included in
R 3.0.3). Sequences ‘\h’, ‘\v’, ‘\H’ and ‘\V’ match horizontal
and vertical space or the negation. (In UTF-8 mode, these do
match non-ASCII Unicode code points.)

Note that backslashes usually need to be doubled/protected in R input, e.g. you would use "\\h" to match horizontal space.

From ?Quotes:

Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.

\n newline

\r carriage return

\t tab

As others comment above, you may need a little more help if you're getting started with regular expressions for the first time. This is a little bit off-topic for StackOverflow (links to off-site resources), but there are some links to regular expression resources at the bottom of the gsubfn package overview. Or Google "regular expression tutorial" ...

What does ?: in a regular expression mean?

It means that it is not capturing group. After successful match first (\d*) will be captured in $1, and second in $2, and (?: \D.*?) would not be captured at all.

$string =~ m/^(\d*)(?: \D.*?)(\d*)$/

From perldoc perlretut

Non-capturing groupings

A group that is required to bundle a set of alternatives may or may not be useful as a capturing group. If it isn't, it just creates a superfluous addition to the set of available capture group values, inside as well as outside the regexp. Non-capturing groupings, denoted by (?:regexp), still allow the regexp to be treated as a single unit, but don't establish a capturing group at the same time.



Related Topics



Leave a reply



Submit