Regex Ignore Url Already in HTML Tags

Regex ignore URL already in HTML tags

Try this

(?<!href=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])

See it here on Regexr

To make it more general you can simplify your lookbehind to check only for "=""

(?<!=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])

See it on Regexr

(?<!href=") is a negative lookbehind assertion, it ensures that there is no "href="" before your pattern.

\b is a word boundary that anchors the start of your link to a change from a non word to a word character. without this the lookbehind would be useless and it would match from the "ttp://..." on.

Adjust regex to ignore anything else inside link HTML tags

Always Use DOM Parsing instead of regex

This has been suggested a multitude of times. And based on the comments to the increasingly complicated regex forming, it would be easier to examine just DOM. Take the following for example:

function fragmentFromString(strHTML) {  return document.createRange().createContextualFragment(strHTML);}
let html = `<a data-popup-text="take me to <a href='http://www.google.com'>a search engine</a>" href="testing.html" data-id="1" data-popup-text="take me to <a href='http://www.google.com'>a search engine</a>"><p>Testing <span>This</span></p></a>`;let fragment = fragmentFromString(html);let aTags = Array.from(fragment.querySelectorAll('a'));
aTags = aTags.map(a => { return { href: a.href, text: a.textContent }});
console.log(aTags);

Javascript regex: Find all URLs outside a tags - Nested Tags

It turned out that probably the best solution is the following:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

  • placing quotes within <a> tags;
  • do not use this algorithm on <a> tags without any attribute (placeholders);
  • as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.



Here is a very good and messy example (the last match should not be found but it is):

https://regex101.com/r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?<\/a>)

regex matching links without a tag

With all the disclaimers about using regex to parse html, if you want to use regex for this task, this will work:

$regex="~<a.*?</a>(*SKIP)(*F)|http://\S+~";

See the demo.

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <a ...tags </a> then deliberately fails, after which the engine skips to the next position in the string. The right side matches the urls, and we know they are the right ones because they were not matched by the expression on the left.

The url regex I put on the right and can be refined, just use whatever suits your needs.

Reference

  • How to match (or replace) a pattern except in situations s1, s2, s3...
  • Article about matching a pattern unless...

How to make auto hyperlink regex ignore img tags with src?

You could add a negative look-behind to the beginning of your regex:

(?<!src=["'])

which will prevent a URL matching if it is preceded by the characters src=" or src='.

Demo on 3v4l.org

Note that if you used a parser (e.g. DOMDocument) you could avoid this problem by only replacing links in the text nodes.

regex linkify urls ignoring existing links

You didn't say which flavor of regex you're using. Hopefully something with working negative lookbehind, like PCRE:

Combining and expanding from the previous answers:

(?<!["']>|["'])(?:(?:https?:\/\/)|(?<!\/\/)www\.|(?:https?::\/\/)www\.)(?:\w+\.)+\w+(?:\/[a-z0-9-._~:\/?#[\]@!$&'()*+,;=%]*)?

Play with it here: https://regex101.com/r/jCpbgi/1

This should work on a large variety of URLs and domain names, and doesn't match previously-linkified URLs.



Related Topics



Leave a reply



Submit