Regex ignore URL already in HTML tags
Try this
(?<!href=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])
See it here on Regexr
To make it more general you can simplify your lookbehind to check only for "=""
(?<!=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])
See it on Regexr
(?<!href=")
is a negative lookbehind assertion, it ensures that there is no "href="" before your pattern.
\b
is a word boundary that anchors the start of your link to a change from a non word to a word character. without this the lookbehind would be useless and it would match from the "ttp://..." on.
Adjust regex to ignore anything else inside link HTML tags
Always Use DOM Parsing instead of regex
This has been suggested a multitude of times. And based on the comments to the increasingly complicated regex forming, it would be easier to examine just DOM. Take the following for example:
function fragmentFromString(strHTML) { return document.createRange().createContextualFragment(strHTML);}
let html = `<a data-popup-text="take me to <a href='http://www.google.com'>a search engine</a>" href="testing.html" data-id="1" data-popup-text="take me to <a href='http://www.google.com'>a search engine</a>"><p>Testing <span>This</span></p></a>`;let fragment = fragmentFromString(html);let aTags = Array.from(fragment.querySelectorAll('a'));
aTags = aTags.map(a => { return { href: a.href, text: a.textContent }});
console.log(aTags);
Javascript regex: Find all URLs outside a tags - Nested Tags
It turned out that probably the best solution is the following:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.
Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a
up to the first "
symbol (as it is not a valid URL symbol but <>
symbols are present with nested tags).
Now also nested tags inside <a>
tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:
- placing quotes within
<a>
tags; - do not use this algorithm on
<a>
tags without any attribute (placeholders); - as well as you may need to avoid using multiple nested tags/lines unless the URL inside
<a>
tag is after any double quote.
Here is a very good and messy example (the last match should not be found but it is):
https://regex101.com/r/pC0jR7/2
It is a pity that this lookahead does not work: (?!<a.*?<\/a>)
regex matching links without a tag
With all the disclaimers about using regex to parse html, if you want to use regex for this task, this will work:
$regex="~<a.*?</a>(*SKIP)(*F)|http://\S+~";
See the demo.
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation |
matches complete <a ...tags </a>
then deliberately fails, after which the engine skips to the next position in the string. The right side matches the urls, and we know they are the right ones because they were not matched by the expression on the left.
The url regex I put on the right and can be refined, just use whatever suits your needs.
Reference
- How to match (or replace) a pattern except in situations s1, s2, s3...
- Article about matching a pattern unless...
How to make auto hyperlink regex ignore img tags with src?
You could add a negative look-behind to the beginning of your regex:
(?<!src=["'])
which will prevent a URL matching if it is preceded by the characters src="
or src='
.
Demo on 3v4l.org
Note that if you used a parser (e.g. DOMDocument
) you could avoid this problem by only replacing links in the text nodes.
regex linkify urls ignoring existing links
You didn't say which flavor of regex you're using. Hopefully something with working negative lookbehind, like PCRE:
Combining and expanding from the previous answers:
(?<!["']>|["'])(?:(?:https?:\/\/)|(?<!\/\/)www\.|(?:https?::\/\/)www\.)(?:\w+\.)+\w+(?:\/[a-z0-9-._~:\/?#[\]@!$&'()*+,;=%]*)?
Play with it here: https://regex101.com/r/jCpbgi/1
This should work on a large variety of URLs and domain names, and doesn't match previously-linkified URLs.
Related Topics
Pros and Cons of Interface Constants
Why Can't I Overload Constructors in PHP
Passing JavaScript Variables to PHP
PHP & MySQL: Using Group by for Categories
How to Remove a Key and Its Value from an Associative Array
Reading Ssl Page with Curl (Php)
PHP - a Db Abstraction Layer Use Static Class VS Singleton Object
Query Time Result in MySQL W/ PHP
Shuffles Random Numbers with No Repetition in JavaScript/Php
How to Format Numbers to Have Only Two Decimal Places
Creating New Laravel Project via Composer Fails with Error Class Arrayloader Not Found
Detect "Overall Average" Color of the Picture
Why Is It Good Save to Save Sessions in the Database
Try/Catch Block in PHP Not Catching Exception
Add a Checkout Checkbox Field That Enable a Percentage Fee in Woocommerce