What does the regex [^\s]*? mean?
Alright, so to answer your first question, I'll break down [^\s]*?
.
The square brackets (
[]
) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time.[abc]
will match the stringsa
,b
, andc
. In this case, your character class is negated using the caret (^
) at the beginning - this inverts its meaning, making it match anything but the characters in it.\s
is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.*?
is a little harder to explain. The*
quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The?
, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*?
means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?
.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:
) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/
) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http
and https
, as well as files that end in both variations of jpg
.
Reference - What does this regex mean?
The Stack Overflow Regular Expressions FAQ
See also a lot of general hints and useful links at the regex tag details page.
Online tutorials
- RegexOne ↪
- Regular Expressions Info ↪
Quantifiers
- Zero-or-more:
*
:greedy,*?
:reluctant,*+
:possessive - One-or-more:
+
:greedy,+?
:reluctant,++
:possessive ?
:optional (zero-or-one)- Min/max ranges (all inclusive):
{n,m}
:between n & m,{n,}
:n-or-more,{n}
:exactly n - Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier:
- Greedy vs. Reluctant vs. Possessive Quantifiers
- In-depth discussion on the differences between greedy versus non-greedy
- What's the difference between
{n}
and{n}?
- Can someone explain Possessive Quantifiers to me? php, perl, java, ruby
- Emulating possessive quantifiers .net
- Non-Stack Overflow references: From Oracle, regular-expressions.info
Character Classes
- What is the difference between square brackets and parentheses?
[...]
: any one character,[^...]
: negated/any character but[^]
matches any one character including newlines javascript[\w-[\d]]
/[a-z-[qz]]
: set subtraction .net, xml-schema, xpath, JGSoft[\w&&[^\d]]
: set intersection java, ruby 1.9+[[:alpha:]]
:POSIX character classes[[:<:]]
and[[:>:]]
Word boundaries- Why do
[^\\D2]
,[^[^0-9]2]
,[^2[^0-9]]
get different results in Java? java - Shorthand:
- Digit:
\d
:digit,\D
:non-digit - Word character (Letter, digit, underscore):
\w
:word character,\W
:non-word character - Whitespace:
\s
:whitespace,\S
:non-whitespace
- Digit:
- Unicode categories (
\p{L}, \P{L}
, etc.)
Escape Sequences
- Horizontal whitespace:
\h
:space-or-tab,\t
:tab - Newlines:
\r
,\n
:carriage return and line feed\R
:generic newline php java-8
- Negated whitespace sequences:
\H
:Non horizontal whitespace character,\V
:Non vertical whitespace character,\N
:Non line feed character pcre php5 java-8 - Other:
\v
:vertical tab,\e
:the escape character
Anchors
anchor | matches | flavors |
---|---|---|
^ | Start of string | Common* |
^ | Start of line | Commonm |
$ | End of line | Commonm |
$ | End of text | Common* except javascript |
$ | Very end of string | javascript*, phpD |
\A | Start of string | Common except javascript |
\Z | End of text | Common except javascript python |
\Z | Very end of string | python |
\z | Very end of string | Common except javascript python |
\b | Word boundary | Common |
\B | Not a word boundary | Common |
\G | End of previous match | Common except javascript, python |
Regex Explanation ^.*$
^
matches position just before the first character of the string$
matches position just after the last character of the string.
matches a single character. Does not matter what character it is, except newline*
matches preceding match zero or more times
So, ^.*$
means - match, from beginning to end, any character that appears zero or more times. Basically, that means - match everything from start to end of the string. This regex pattern is not very useful.
Let's take a regex pattern that may be a bit useful. Let's say I have two strings The bat of Matt Jones
and Matthew's last name is Jones
. The pattern ^Matt.*Jones$
will match Matthew's last name is Jones
. Why? The pattern says - the string should start with Matt and end with Jones and there can be zero or more characters (any characters) in between them.
Feel free to use an online tool like https://regex101.com/ to test out regex patterns and strings.
What does [^.]* mean in regular expression?
Within the []
the .
means just a dot. And the leading ^
means "anything but ...".
So [^.]*
matches zero or more non-dots.
What does the regex /\\*{2,}/ mean?
That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times
.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as:match a single \ and another \ n-times 2 times or more
This is invalid syntax/\*{2,}\
Read asmatch 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/'
OR as "/\\*{2,}/"
(watch the quotes).
Both are very similar, except that single quotes (''
) only support the following escape sequences:
\'
- Produces'
\\
- Produces\
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\"
- Produces"
\'
- Produces'
\\
- Produces\
\x<2-digit hex number>
- Same aschr(0x<2-digit hex number>)
\0
- Produces anull
char\1
- Produces a control char (same aschr(1)
)\u<4-digit hex number>
- Produces an UTF-8 character\r
- Produces a newline on old OSX\n
- Produces a newline on Linux/newer OSX/Windows (when writting a file withoutb
)\t
- Produces a tab\<number>
or\0<number>
- Same as\x
, but the numbers are in octal (e.g.:"\75"
and"\075"
produce=
)- ... (some more that I probably forgot) ...
\<anything>
- Produces<anything>
Read more about this on https://php.net/manual/en/language.types.string.php
Meaning of regular expressions like - \\d , \\D, ^ , $ etc
From ?regexp
, in the Extended Regular Expressions section:
The caret ‘^’ and the dollar sign ‘$’ are metacharacters that
respectively match the empty string at the beginning and end of a
line. The symbols ‘\<’ and ‘>’ match the empty string at the
beginning and end of a word. The symbol ‘\b’ matches the empty
string at either edge of a word, and ‘\B’ matches the empty string
provided it is not at an edge of a word. (The interpretation of
‘word’ depends on the locale and implementation: these are all
extensions.)
From Perl-like Regular Expressions:
The escape sequences ‘\d’, ‘\s’ and ‘\w’ represent any decimal
digit, space character and ‘word’ character (letter, digit or
underscore in the current locale: in UTF-8 mode only ASCII letters
and digits are considered) respectively, and their upper-case
versions represent their negation. Vertical tab was not regarded
as a space character in a ‘C’ locale before PCRE 8.34 (included in
R 3.0.3). Sequences ‘\h’, ‘\v’, ‘\H’ and ‘\V’ match horizontal
and vertical space or the negation. (In UTF-8 mode, these do
match non-ASCII Unicode code points.)
Note that backslashes usually need to be doubled/protected in R input, e.g. you would use "\\h"
to match horizontal space.
From ?Quotes
:
Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.
\n newline
\r carriage return
\t tab
As others comment above, you may need a little more help if you're getting started with regular expressions for the first time. This is a little bit off-topic for StackOverflow (links to off-site resources), but there are some links to regular expression resources at the bottom of the gsubfn package overview. Or Google "regular expression tutorial" ...
What does ?: in a regular expression mean?
It means that it is not capturing group. After successful match first (\d*)
will be captured in $1
, and second in $2
, and (?: \D.*?)
would not be captured at all.
$string =~ m/^(\d*)(?: \D.*?)(\d*)$/
From perldoc perlretut
Non-capturing groupings
A group that is required to bundle a set of alternatives may or may not be useful as a capturing group. If it isn't, it just creates a superfluous addition to the set of available capture group values, inside as well as outside the regexp. Non-capturing groupings, denoted by (?:regexp), still allow the regexp to be treated as a single unit, but don't establish a capturing group at the same time.
Related Topics
Redefine Class Methods or Class
When (If Ever) Is Eval Not Evil
Converting Words to Numbers in PHP
Dollar ($) Sign in Password String Treated as Variable
PHP - How to Force Download of a File
Accessing Dates in PHP Beyond 2038
Dompdf and Set Different Font-family
Why Is Using a MySQL Prepared Statement More Secure Than Using the Common Escape Functions
PHP: Utilizing Exit(); or Die(); After Header("Location: ");
How to Prepend File to Beginning
PHP Namespace Simplexml Problems
How to Test If a MySQL Query Was Successful in Modifying Database Table Data
When to Use Static VS Instantiated Classes
Codeigniter: Create New Helper
Configure Wamp Server to Send Email