PHP Linkify Links in Content

PHP Linkify Links In Content

I have an open source project on GitHub: LinkifyURL which you may want to consider. It has a function: linkify() which plucks URLs from text and converts them to links. Note that this is not a trivial task to do correctly! (See: The Problem With URLs - ands be sure to read the thread of comments to grasp all the things that can go wrong.)

If you really need to NOT linkify specific domains (i.e. vimeo and youtube), here is a modified PHP function linkify_filtered (in the form of a working test script) that does what you need:

<?php // test.php 20110313_1200

function linkify_filtered($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
//    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
//    return preg_replace($url_pattern, $url_replace, $text);
    $url_replace = '_linkify_filter_callback';
    return preg_replace_callback($url_pattern, $url_replace, $text);
}
function _linkify_filter_callback($m)
{ // Filter out youtube and vimeo domains.
    $pre  = $m[1].$m[4].$m[7].$m[10].$m[13];
    $url  = $m[2].$m[5].$m[8].$m[11].$m[14];
    $post = $m[3].$m[6].$m[9].$m[12];
    if (preg_match('/\b(?:youtube|vimeo)\.com\b/', $url)) {
        return $pre . $url . $post;
    } // else linkify...
    return $pre .'<a href="'. $url .'">' . $url .'</a>' .$post;
}

// Create some test data.
$data = 'Plain URLs (not delimited):
foo http://example.com bar...
foo http://example.com:80 bar...
foo http://example.com:80/path/ bar...
foo http://example.com:80/path/file.txt bar...
foo http://example.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://example.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://example.com/(file\'s_name.txt) bar... (with \' and (parentheses))
foo http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7348] bar... ([IPv6 literal])
foo http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7348]/file.txt bar... ([IPv6] with path)
foo http://youtube.com bar...
foo http://youtube.com:80 bar...
foo http://youtube.com:80/path/ bar...
foo http://youtube.com:80/path/file.txt bar...
foo http://youtube.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://youtube.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://youtube.com/(file\'s_name.txt) bar... (with \' and (parentheses))
foo http://vimeo.com bar...
foo http://vimeo.com:80 bar...
foo http://vimeo.com:80/path/ bar...
foo http://vimeo.com:80/path/file.txt bar...
foo http://vimeo.com:80/path/file.txt?query=val&var2=val2 bar...
foo http://vimeo.com:80/path/file.txt?query=val&var2=val2#fragment bar...
foo http://vimeo.com/(file\'s_name.txt) bar... (with \' and (parentheses))
';
// Verify it works...
echo(linkify_filtered($data) ."\n");

?>

This employs a callback function to do the filtering. Yes, the regex is complex (but so it the problem as it turns out!). You can see the interactive Javascript version of linkify() in action here: URL Linkification (HTTP/FTP).

Also, John Gruber has a pretty good regex to do linkification. See: An Improved Liberal, Accurate Regex Pattern for Matching URLs. However, his regex suffers catastrophic backtracking under certain circumstances. (I've written to him about this, but he has yet to respond.)

Hope this helps! :)

PHP: How do I linkify all links inside a given text?

That's the kind of thing best left to a 3rd party library (which you're doing, so kudos). I'd recommend trying another one before you roll your own. purl is an excellent alternative.

How do I linkify urls in a string with php?

You can use the following:

$string = "Look on http://www.google.com";
$string = preg_replace(
              "~[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]~",
              "<a href=\"\\0\">\\0</a>", 
              $string);

PHP versions < 5.3 (ereg_replace) otherwise (preg_replace)

Linkify URLs with PHP, trim outputted urls length

Create a capture group after the protocol:

$string = preg_replace(
  "~[[:alpha:]]+://([^<>[:space:]]+[[:alnum:]/])~",
  "<a href=\"\\0\">\\1</a>", 
  $string
);

then \1 will be the URL without the protocol. For the text limiting I'd recommend using CSS, Setting a max character length in css.

convert url to links from string except if they are in an attribute of an html tag

Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a " in front of your URL and not a space, as your pattern requires.

However, here is different solution. It might not work 100% if you have single < or > within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing > before any opening < (because this means, you are inside a tag).

$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
$content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");

In case you are not familiar with this technique, here is a bit more elaboration.

(?!        # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
[^<>]      # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
*          # arbitrary many of those characters (but in a row; so not a single < or > in between)
>          # the closing >
)          # ends the lookahead subpattern

Note that I changed the regex delimiters, because I am now using ! within the regex.

Unless you need the first subpattern (\s|^) for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).

$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1? If you missed this by accident, add the # to your allowed URL characters:

$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

EDIT: Also, what about + and %? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT

I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.

One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.

Linkify PHP text

This is working well on the sites I am using it for...

function find_urls($t){
    $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
    // Check if there is a url in the text
    if(preg_match($reg_exUrl, $t, $url)) {
        $add='';
        if (substr($url[0],(strlen($url[0])-1),strlen($url[0]))==")"){
            $url[0]=substr($url[0],0,(strlen($url[0])-1));
            $add=')';
        } else if (substr($url[0],(strlen($url[0])-1),strlen($url[0]))=="]"){
            $url[0]=substr($url[0],0,(strlen($url[0])-1));
            $add=']';
        }
        // make the urls hyper links
        return preg_replace($reg_exUrl, '<a href="'.$url[0].'">'.$url[0].'</a>'.$add, $t);
    } else {
        // if no urls in the text just return the text
        return $t;
    }
}

how to find image only links in string and linkify them

You can add another argument called $allowed_types, which holds all the extensions you want to allow.

Then you must get the substring after last '.' character, and compare it to your list of allowed extensions.

This is the basic idea, I'm sure it can be improved alot.

/**
 * Turn all URLs in clickable links.
 * 
 * @param string $value
 * @param array  $protocols  http/https, ftp, mail, twitter
 * @param array  $attributes
 * @param string $mode       normal or all
 * @return string
 */
function linkify($value, $allowed_types = array('jpg', 'png'), $protocols = array('http', 'mail'), array $attributes = array()) {

    /**
     * Get position of last dot in string
     */
    $dot_pos = strrpos($value, '.');
    if(!$dot_pos) {
        return FALSE;
    }

    /**
     * Get substring after last dot
     */
    $extension = substr($value, $dot_pos + 1);

    if(!in_array($extension, $allowed_types)) {
        /**
         * Extension not in allowed types
         */
        return FALSE;
    }

    // Link attributes
    $attr = '';
    foreach ($attributes as $key => $val) {
        $attr = ' ' . $key . '="' . htmlentities($val) . '"';
    }

    $links = array();

    // Extract existing links and tags
    $value = preg_replace_callback('~(<a .*?>.*?</a>|<.*?>)~i', function ($match) use (&$links) {
        return '<' . array_push($links, $match[1]) . '>';
    }, $value);

    // Extract text links for each protocol
    foreach ((array) $protocols as $protocol) {
        switch ($protocol) {
            case 'http':
            case 'https': $value = preg_replace_callback('~(?:(https?)://([^\s<]+)|(www\.[^\s<]+?\.[^\s<]+))(?<![\.,:])~i', function ($match) use ($protocol, &$links, $attr) {
                    if ($match[1])
                        $protocol = $match[1];
                    $link = $match[2] ? : $match[3];
                    return '<' . array_push($links, "<a $attr href=\"$protocol://$link\">$link</a>") . '>';
                }, $value);
                break;
            case 'mail': $value = preg_replace_callback('~([^\s<]+?@[^\s<]+?\.[^\s<]+)(?<![\.,:])~', function ($match) use (&$links, $attr) {
                    return '<' . array_push($links, "<a $attr href=\"mailto:{$match[1]}\">{$match[1]}</a>") . '>';
                }, $value);
                break;
            case 'twitter': $value = preg_replace_callback('~(?<!\w)[@#](\w++)~', function ($match) use (&$links, $attr) {
                    return '<' . array_push($links, "<a $attr href=\"https://twitter.com/" . ($match[0][0] == '@' ? '' : 'search/%23') . $match[1] . "\">{$match[0]}</a>") . '>';
                }, $value);
                break;
            default: $value = preg_replace_callback('~' . preg_quote($protocol, '~') . '://([^\s<]+?)(?<![\.,:])~i', function ($match) use ($protocol, &$links, $attr) {
                    return '<' . array_push($links, "<a $attr href=\"$protocol://{$match[1]}\">{$match[1]}</a>") . '>';
                }, $value);
                break;
        }
    }

    // Insert all link
    return preg_replace_callback('/<(\d+)>/', function ($match) use (&$links) {
        return $links[$match[1] - 1];
    }, $value);
}

PHP Linkify Links in Content