Replace Urls in Text With HTML Links

Replace URLs in text with HTML links

Let's look at the requirements. You have some user-supplied plain text, which you want to display with hyperlinked URLs.

  1. The "http://" protocol prefix should be optional.
  2. Both domains and IP addresses should be accepted.
  3. Any valid top-level domain should be accepted, e.g. .aero and .xn--jxalpdlp.
  4. Port numbers should be allowed.
  5. URLs must be allowed in normal sentence contexts. For instance, in "Visit stackoverflow.com.", the final period is not part of the URL.
  6. You probably want to allow "https://" URLs as well, and perhaps others as well.
  7. As always when displaying user supplied text in HTML, you want to prevent cross-site scripting (XSS). Also, you'll want ampersands in URLs to be correctly escaped as &.
  8. You probably don't need support for IPv6 addresses.
  9. Edit: As noted in the comments, support for email-adresses is definitely a plus.
  10. Edit: Only plain text input is to be supported – HTML tags in the input should not be honoured. (The Bitbucket version supports HTML input.)

Edit: Check out GitHub for the latest version, with support for email addresses, authenticated URLs, URLs in quotes and parentheses, HTML input, as well as an updated TLD list.

Here's my take:

<?php
$text = <<<EOD
Here are some URLs:
stackoverflow.com/questions/1188129/pregreplace-to-detect-html-php
Here's the answer: http://www.google.com/search?rls=en&q=42&ie=utf-8&oe=utf-8&hl=en. What was the question?
A quick look at http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax is helpful.
There is no place like 127.0.0.1! Except maybe http://news.bbc.co.uk/1/hi/england/surrey/8168892.stm?
Ports: 192.168.0.1:8080, https://example.net:1234/.
Beware of Greeks bringing internationalized top-level domains: xn--hxajbheg2az3al.xn--jxalpdlp.
And remember.Nobody is perfect.

<script>alert('Remember kids: Say no to XSS-attacks! Always HTML escape untrusted input!');</script>
EOD;

$rexProtocol = '(https?://)?';
$rexDomain = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort = '(:[0-9]{1,5})?';
$rexPath = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

// Solution 1:

function callback($match)
{
// Prepend http:// if no protocol specified
$completeUrl = $match[1] ? $match[0] : "http://{$match[0]}";

return '<a href="' . $completeUrl . '">'
. $match[2] . $match[3] . $match[4] . '</a>';
}

print "<pre>";
print preg_replace_callback("&\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))&",
'callback', htmlspecialchars($text));
print "</pre>";
  • To properly escape < and & characters, I throw the whole text through htmlspecialchars before processing. This is not ideal, as the html escaping can cause misdetection of URL boundaries.
  • As demonstrated by the "And remember.Nobody is perfect." line (in which remember.Nobody is treated as an URL, because of the missing space), further checking on valid top-level domains might be in order.

Edit: The following code fixes the above two problems, but is quite a bit more verbose since I'm more or less re-implementing preg_replace_callback using preg_match.

// Solution 2:

$validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

$position = 0;
while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position))
{
list($url, $urlPosition) = $match[0];

// Print the text leading up to the URL.
print(htmlspecialchars(substr($text, $position, $urlPosition - $position)));

$domain = $match[2][0];
$port = $match[3][0];
$path = $match[4][0];

// Check if the TLD is valid - or that $domain is an IP address.
$tld = strtolower(strrchr($domain, '.'));
if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
{
// Prepend http:// if no protocol specified
$completeUrl = $match[1][0] ? $url : "http://$url";

// Print the hyperlink.
printf('<a href="%s">%s</a>', htmlspecialchars($completeUrl), htmlspecialchars("$domain$port$path"));
}
else
{
// Not a valid URL.
print(htmlspecialchars($url));
}

// Continue text parsing from after the URL.
$position = $urlPosition + strlen($url);
}

// Print the remainder of the text.
print(htmlspecialchars(substr($text, $position)));

Replace all URLs in text to clickable links in PHP

function convert($input) {
$pattern = '@(http(s)?://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])@';
return $output = preg_replace($pattern, '<a href="http$2://$3">$0</a>', $input);
}

demo

How to replace plain URLs with links?

First off, rolling your own regexp to parse URLs is a terrible idea. You must imagine this is a common enough problem that someone has written, debugged and tested a library for it, according to the RFCs. URIs are complex - check out the code for URL parsing in Node.js and the Wikipedia page on URI schemes.

There are a ton of edge cases when it comes to parsing URLs: international domain names, actual (.museum) vs. nonexistent (.etc) TLDs, weird punctuation including parentheses, punctuation at the end of the URL, IPV6 hostnames etc.

I've looked at a ton of libraries, and there are a few worth using despite some downsides:

  • Soapbox's linkify has seen some serious effort put into it, and a major refactor in June 2015 removed the jQuery dependency. It still has issues with IDNs.
  • AnchorMe is a newcomer that claims to be faster and leaner. Some IDN issues as well.
  • Autolinker.js lists features very specifically (e.g. "Will properly handle HTML input. The utility will not change the href attribute inside anchor () tags"). I'll thrown some tests at it when a demo becomes available.

Libraries that I've disqualified quickly for this task:

  • Django's urlize didn't handle certain TLDs properly (here is the official list of valid TLDs. No demo.
  • autolink-js wouldn't detect "www.google.com" without http://, so it's not quite suitable for autolinking "casual URLs" (without a scheme/protocol) found in plain text.
  • Ben Alman's linkify hasn't been maintained since 2009.

If you insist on a regular expression, the most comprehensive is the URL regexp from Component, though it will falsely detect some non-existent two-letter TLDs by looking at it.

replace any url's within a string of text, to clickable links with php

You can use regexp to do this:

$html_links = preg_replace('"\b(https?://\S+)"', '<a href="$1">$1</a>', $text);

replace URLs in text with links to URLs

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g

I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?

import re

urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")

def urlify2(value):
return urlfinder.sub(r'<a href="\1">\1</a>', value)

call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.



Related Topics



Leave a reply



Submit