Linkify Regex Function PHP Daring Fireball Method

Linkify Regex Function PHP Daring Fireball Method

Try this:

$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`\!()\[\]{};:\'".,<>?«»“”‘’]))';     
return preg_replace("!$pattern!i", "<a href=\"\\0\" rel=\"nofollow\" target=\"_blank\">\\0</a>", $str);

PHP's preg function do need delimiters. The i at the end makes it case-insensitive

Update

If you use # as the delimiter, you wan't need to escape the ! in the pattern as such use the original pattern string (the pattern does not have a #): "#$pattern#i"

Update 2

To ensure that the links are correct, do this:

$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
return '<a href="' . $url . '" rel="nofollow" target="_blank">' . "$input</a>";
}, $str);

This will now append http:// to the urls so that browser doesn't think it is a relative link.

URL detection in a string

Try this regular expression :

#(https?://)?([a-z0-9-]+\.)+[a-z0-9]+/?#i

Regular expression pattern to match url with or without http(s) and without tags

Nevermind, found the solution already.

With that solution, everything works fine

$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`\!()\[\]{};:\'".,<>?«»“”‘’]))';     
return preg_replace("!$pattern!i", "<a href=\"\\0\" rel=\"nofollow\" target=\"_blank\">\\0</a>", $str);

Mitigate xss attacks when building links

Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.

The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that @Rook has been mentioning.

So I cannot really think of a way how an XSS attack could be performed the following code as suggested by @tobyodavies, but without urlencode, which does something else:

$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = $matches[2] == 'http' ? $input : "http://$input";
return '<a href="' . htmlspecialchars($url) . '" rel="nofollow" target="_blank">' . "$input</a>";
}, $str);

Note that I have also a added a small shortcut for checking the http prefix.

Now the anchor links that you generate are safe.

However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.

Match URL pattern in PHP using a regular expression

I'd use a different regex to be honest. Like this one that Gruber posted in 2009:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

Or this updated version that Gruber posted in 2010 (thanks, @IMSoP):

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Update function to recognize links

I won't go down the rabbit hole about constructing a world-conquering regex pattern to extract all valid urls the world can dream up including unicode while denying urls with valid characters but illogical structures. (I'll go with Gumbo and move on.)

For a regex demo see: https://regex101.com/r/HFCP1Z/1/

Things to note:

  • If a url is matched, there is no capture group, so $m[1] isn't generated. If a user/hash tag is matched, the fullstring match and capture group 1 is generated. If an emoji is matched, the fullstring match is populated, the capture group 1 element is empty (but declared because php generates $m as an indexed array -- no gaps), and capture group 2 holds the emoji's parenthetical substring.

  • You need to be sure that you don't accidentally replace part of a url which contains a qualifying hashtag/usertag substring. (Currently, the other answers don't consider this vulnerability.) I am going to prevent that scenario by performing a single pass over the input and consuming whole url substrings before the other patterns get a chance at it.
    (notice: http://example.com/@dave and http://example.com?asdf=1234#anchor)

  • There are two reason that I am declaring your hashtag/usertag lookup array as a constant.

    1. It does not vary, so it needn't be a variable.
    2. It enjoys global scope, so the use() syntax is not necessary inside of preg_replace_callback().
  • You should avoid adding inline styling to your tags. I recommend assigning a class so that you can simply update a single portion of the stylesheet when you decide to amend/extend the styling at a later time.

Code: (Demo)

define('PINGTAGS', [
'#' => 'hashtag.php?hashtag',
'@' => 'user.php?user'
]);

function convert_text($str) {
return preg_replace_callback(
"~(?i)\bhttps?[-\w.\~:/?#[\]@!$&'()*+,;=]+|[@#](\w+)|U\+([A-F\d]{5})~",
function($m) {
// var_export($m); // see for yourself
if (!isset($m[1])) { // url
return sprintf('<a href="%s">%s</a>', $m[0], $m[0]);
}
if (!isset($m[2])) { // pingtag
return sprintf('<a href="%s=%s">%s</a>', PINGTAGS[$m[0][0]], $m[1], $m[0]);
}
return "<span class=\"emoji\">&#x{$m[2]};</span>"; // emoji
},
$str);
}

echo convert_text(
<<<STRING
This is a @ping and a #hash.
This is a www.example.com, this is http://example.com?asdf=1234#anchor
https://www.example.net/a/b/c/?g=5&awesome=foobar# U+23232 http://www5.example.com
https://sub.sub.www.example.org/ @pong@pug#tagged
http://example.com/@dave
more http://example.com/more_(than)_one_(parens)
andU+98765more http://example.com/blah_(wikipedia)#cite-1
and more http://example.com/blah_(wikipedia)_blah#cite-1
and more http://example.com/(something)?after=parens
STRING
);

Raw Output:

This is a <a href="user.php?user=ping">@ping</a> and a <a href="hashtag.php?hashtag=hash">#hash</a>.
This is a www.example.com, this is <a href="http://example.com?asdf=1234#anchor">http://example.com?asdf=1234#anchor</a>
<a href="https://www.example.net/a/b/c/?g=5&awesome=foobar#">https://www.example.net/a/b/c/?g=5&awesome=foobar#</a> <span class="emoji">𣈲</span> <a href="http://www5.example.com">http://www5.example.com</a>
<a href="https://sub.sub.www.example.org/">https://sub.sub.www.example.org/</a> <a href="user.php?user=pong">@pong</a><a href="user.php?user=pug">@pug</a><a href="hashtag.php?hashtag=tagged">#tagged</a>
<a href="http://example.com/@dave">http://example.com/@dave</a>
more <a href="http://example.com/more_(than)_one_(parens)">http://example.com/more_(than)_one_(parens)</a>
and<span class="emoji">򘝥</span>more <a href="http://example.com/blah_(wikipedia)#cite-1">http://example.com/blah_(wikipedia)#cite-1</a>
and more <a href="http://example.com/blah_(wikipedia)_blah#cite-1">http://example.com/blah_(wikipedia)_blah#cite-1</a>
and more <a href="http://example.com/(something)?after=parens">http://example.com/(something)?after=parens</a>

Stackoverflow-Rendered Output:

This is a @ping and a #hash.
This is a www.example.com, this is http://example.com?asdf=1234#anchor
https://www.example.net/a/b/c/?g=5&awesome=foobar# 𣈲 http://www5.example.com
https://sub.sub.www.example.org/ @pong@pug#tagged
http://example.com/@dave
more http://example.com/more_(than)one(parens)
and򘝥more http://example.com/blah_(wikipedia)#cite-1
and more http://example.com/blah_(wikipedia)_blah#cite-1
and more http://example.com/(something)?after=parens

p.s. The hash and user tags aren't highlighted here, but they are the local links that you asked for.



Related Topics



Leave a reply



Submit