Regex Matching Links Without <A> Tag

regex matching links without a tag

With all the disclaimers about using regex to parse html, if you want to use regex for this task, this will work:

$regex="~<a.*?</a>(*SKIP)(*F)|http://\S+~";

See the demo.

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <a ...tags </a> then deliberately fails, after which the engine skips to the next position in the string. The right side matches the urls, and we know they are the right ones because they were not matched by the expression on the left.

The url regex I put on the right and can be refined, just use whatever suits your needs.

Reference

  • How to match (or replace) a pattern except in situations s1, s2, s3...
  • Article about matching a pattern unless...

Regex match url without a tag - JS

Instead of using regex, you can test the URL that was supplied to see if it is treated as an external resource. Here, I split the string by whitespace, then tested each part to see if it is a defined URL.

const string = `Here is a link https://www.somewebsite.com/ here is a link already in an a tag <a href="https://www.somewebsite.com/">https://www.somewebsite.com/</a>`;
const newString = string .split(/\s+/) .map(string => (isDefinedUrl(string)) ? makeUrl(string) : string) .join(' ');
console.log(newString);

function isDefinedUrl(possibleUrl) { const a = document.createElement('a'); a.href = possibleUrl; return (possibleUrl === a.href);}
function makeUrl(url) { return `<a href="${url}">${url}</a>`;}

Javascript regex: Find all URLs outside a tags - Nested Tags

It turned out that probably the best solution is the following:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

  • placing quotes within <a> tags;
  • do not use this algorithm on <a> tags without any attribute (placeholders);
  • as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.



Here is a very good and messy example (the last match should not be found but it is):

https://regex101.com/r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?<\/a>)

Regular expression pattern to match url with or without http(s) and without tags

Nevermind, found the solution already.

With that solution, everything works fine

$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`\!()\[\]{};:\'".,<>?«»“”‘’]))';     
return preg_replace("!$pattern!i", "<a href=\"\\0\" rel=\"nofollow\" target=\"_blank\">\\0</a>", $str);

Match link tag with any attribute

Here's the regex. It matches everything from <link to </link>

<link[^>]*href[^>]*>.*?</link>

Here are the results showing what will and will not work with this regex.

<link href="asdsada" />  EMPTY FAILS  
<link href="asdsada">adsasd</link> NORMAL OK
<link href="asdsada"><div>asdasdsa</div></link> NESTED ELEMENTS OK
<link href="asdasda"><link>asdasd</link></link> NESTED Link FAILS

You can also use groups to capture the href attribute and inner content. Though if you want the href attribute you have to count on the quotes being in there.

<link[^>]*href="([^"]*)"[^>]*>(.*?)</link>

Regular expression to find URLs not inside a hyperlink

You can do it in two steps instead of trying to come up with a single regular expression:

  1. Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).

  2. Match the URL

In Perl it could be:

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}

Capturing all the occurrences of a specific word when is not part of a link

In PCRE, you may use this regex:

~(?is)<a .*?</a>(*SKIP)(*F)|\bapple\b~

RegEx Demo

RegEx Details:

  • (?is): Enable ignore case and DOTALL modes
  • <a .*?</a>: Match text from <a to </a> to skip all <a> tage
  • (*SKIP)(*F): together provide a nice alternative of restriction that you cannot have a variable length lookbehind in PCRE regex
  • |: OR
  • \bapple\b: Match word apple

regular expression for finding 'href' value of a a link

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx));});
<label>  Text to match:  <input type="text" name="textToMatch" value='<a href="google.com"'>    <button>Match</button> </label>


Related Topics



Leave a reply



Submit