PHP Validation/Regex For Url

PHP validation/regex for URL

I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:

$text = preg_replace(
'#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
"'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
$text
);

Most of the random junk at the end is to deal with situations like http://domain.example. in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.

validate url with regular expressions

Try this Expression

/[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi

It will aceept all the cases that you have mentioned above

regex to check valid url either http or www

see this link may help you.

<?php
// Variable to check
$url = "http://www.w3schools.com";

// Validate url
if (!filter_var($url, FILTER_VALIDATE_URL) === false) {
echo("$url is a valid URL");
} else {
echo("$url is not a valid URL");
}
?>

PHP regex match all urls

This one match correctly all you posted:

preg_match_all('#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#si', $targetString, $result);

Validating a URL in PHP

Yes, there is! Use filter_var:

if (filter_var($url, FILTER_VALIDATE_URL) !== false) ...

FILTER_VALIDATE_URL validates URLs according to RFC 2396.

regular expression to validate URL not working correctly in PHP

Wow, that is a big expression. I found several faults in it, and I shall hopefully explain them to you. Let's break it apart:

$pattern ="/

Here was your first mistake. As a forward slash is used in multiple sections of a url, you should use a different delimiter. I would suggest a tilde ~, as this is not used in a url very often. This would mean you don't have to keep escaping the forward slash every where with \/.

^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+

This character class contains the next error. Within a character class, a dot just means a dot. There is no need to escape it. Furthermore, with placing the dash at the end, it also does not need escaping as it cannot possibly mean a range. The character class can be shortened to become [a-zA-Z0-9.-]+.

(\:[a-zA-Z0-9\.&%\$\-]+

Here we have the next error, & within the character class. This will match an & or an a or an m or a ;, not just an &. You don't need to convert it to the html code as doing so will mean to match any of the characters that the code contains. And using the previous knowledge, you don't need to escape the dot, or the dash if it is at the end. You also don't need to escape the dollar sign, as in a character class it just means a dollar. Remember, within a character class, all meta characters are just standard characters except the caret ^, the backslash \, the closing square bracket ], the dash - (but this can be left if it's at the end), and whatever you choose as your delimiter, e.g. tilde ~. This character class can then become, [a-zA-Z0-9.&%$-]+.

)*@)*(\.){1}

Part of this might be an error, it might not be. Basically, is there any need to capture the dot here? If there is not a need to capture it, leave the brackets alone. However, there is a definite error in the repetition. {1} is completely and utterly superfluous. Everything in there has to be repeated at least once. This is just making the code messy. The above can shortened into, )*@)*\..

((25[0-5]|2[0-4][0-9]|[0-1]{1}

Again, the {1} is not needed. Remove it, ((25[0-5]|2[0-4][0-9]|[0-1].

[0-9]{2}|[1-9]{1}[0-9]{1}

And again twice, this becomes [0-9]{2}|[1-9][0-9].

You keep doing this, the next block of code you have can be shortened:

|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])

Into

|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])

It's not amazingly better, but every little helps. Next:

|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+

The two character classes can be optimized, |([a-zA-Z0-9-]+\.)*[a-zA-Z0-9-]+.

\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2})

This is very restrictive, but I assume you have it like this for a reason so I'll leave it.

)(\:[0-9]+)*(/

And here is the cause of your error. You did not escape the forward slash. However, I am going to leave it as using a different delimiter would avoid this and also tidy up your pattern.

($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";

That character class can be greatly shortened now knowing that we don't need to escape everything within them. It can become, ($|[a-zA-Z0-9.,?'\\+&%$#=~_-]+))*$/";.

Using everything we now know your pattern can be made much prettier and easier to handle.

It can become instead:

$pattern = "~^(http|https|ftp)://www\.([a-zA-Z0-9.-]+(:[a-zA-Z0-9.&%$-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])|([a-zA-Z0-9-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(:[0-9]+)*(/($|[a-zA-Z0-9.,?'\\+&%$#=\~_-]+))*$~";

Now that you have a smaller expression, finding faults and more customization should be a little easier.

Just a quick note

I keep noticing that you have used the following syntax at the beginning of some groupings, (\:. I have removed the backslash as it is not needed for a colon. However, were you trying to make it so the group was not captured? If so, the syntax for that is, (?:.

Edit:: You can also optimize the pattern further by utilizing character classes

\d = [0-9]

\w = [a-zA-Z0-9_]

Adding i to the end of the last pattern delimiter turns case insensitivity on too. Which means, instead of writing [a-zA-Z] you can just write [a-z] instead.

Also, the http|https can just become https?

So you pattern could be shortened further too:

$pattern = "~^(https?|ftp)://www\.([a-z\d.-]+(:[a-z\d.&%$-]+)*@)*((25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9])\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|\d)|([a-z\d-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-z]{2}))(:\d+)*(/($|[\w.,?'\\+&%$#=\~-]+))*$~i";

PHP URL validation

^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)

Explanation

^                # start-of-line
(?: # begin non-capturing group
https? # "http" or "https"
:// # "://"
)? # end non-capturing group, make optional
(?: # start non-capturing group
[a-z0-9-]+\. # a name part (numbers, ASCII letters, dashes) & a dot
)* # end non-capturing group, match as often as possible
( # begin group 1 (this will be the domain name)
(?: # start non-capturing group
[a-z0-9-]+\. # a name part, same as above
) # end non-capturing group
[a-z]+ # the TLD
) # end group 1

http://rubular.com/r/g6s9bQpNnC

Validating Youtube URL using Regex

There are a lot of redundancies in this regular expression of yours (and also, the leaning toothpick syndrome). This, though, should produce results:

$rx = '~
^(?:https?://)? # Optional protocol
(?:www[.])? # Optional sub-domain
(?:youtube[.]com/watch[?]v=|youtu[.]be/) # Mandatory domain name (w/ query string in .com)
([^&]{11}) # Video id of 11 characters as capture group 1
~x';

$has_match = preg_match($rx, $url, $matches);

// if matching succeeded, $matches[1] would contain the video ID

Some notes:

  • use the tilde character ~ as delimiter, to avoid LTS
  • use [.] instead of \. to improve visual legibility and avoid LTS. ("Special" characters - such as the dot . - have no effect in character classes (within square brackets))
  • to make regular expressions more "readable" you can use the x modifier (which has further implications; see the docs on Pattern modifiers), which also allows for comments in regular expressions
  • capturing can be suppressed using non-capturing groups: (?: <pattern> ). This makes the expression more efficient.

Optionally, to extract values from a (more or less complete) URL, you might want to make use of parse_url():

$url = 'http://youtube.com/watch?v=VIDEOID';
$parts = parse_url($url);
print_r($parts);

Output:

Array
(
[scheme] => http
[host] => youtube.com
[path] => /watch
[query] => v=VIDEOID
)

Validating the domain name and extracting the video ID is left as an exercise to the reader.


I gave in to the comment war below; thanks to Toni Oriol, the regular expression now works on short (youtu.be) URLs as well.

Regular expression pattern to match URL with or without http://www

For matching all kinds of URLs, the following code should work:

<?php
$regex = "((https?|ftp)://)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))"; // Host or IP
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+$_%-]\.?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$%_.-]*)?"; // Anchor
?>

Then, the correct way to check against the regex is as follows:

<?php
if(preg_match("~^$regex$~i", 'www.example.com/etcetc', $m))
var_dump($m);

if(preg_match("~^$regex$~i", 'http://www.example.com/etcetc', $m))
var_dump($m);
?>

Courtesy: Comments made by splattermania in the PHP manual: http://php.net/manual/en/function.preg-match.php

RegEx Demo in regex101



Related Topics



Leave a reply



Submit