PHP Regular Expression to Match Keyword Outside HTML Tag ≪A≫

PHP Regular expression to match keyword outside HTML tag a

I managed to do what I wanted (without using Regex) by:

  • parsing each character of my string
  • removing all <a> tags (copying them to a temporary array and keeping a placeholder on the string)
  • str_replace the new string in order to replace all the keywords
  • repopulating the placeholders by it's original <a> tags

Here's the code I used, in case someone else needs it:

$str = <<<STRA
Moses supposes his toeses are roses,
but <a href="original-moses1.html">Moses</a> supposes erroneously;
for nobody's toeses are posies of roses,
as Moses supposes his toeses to be.
Ganda <span class="cenas"><a href="original-moses2.html" target="_blank">Moses</a></span>!
STRA;

$arr1 = str_split($str);

$arr_links = array();
$phrase_holder = '';
$current_a = 0;
$goto_arr_links = false;
$close_a = false;

foreach($arr1 as $k => $v)
{
if ($close_a == true)
{
if ($v == '>') {
$close_a = false;
}
continue;
}

if ($goto_arr_links == true)
{
$arr_links[$current_a] .= $v;
}

if ($v == '<' && $arr1[$k+1] == 'a') { /* <a */
// keep collecting every char until </a>
$arr_links[$current_a] .= $v;
$goto_arr_links = true;
} elseif ($v == '<' && $arr1[$k+1] == '/' && $arr1[$k+2] == 'a' && $arr1[$k+3] == '>' ) { /* </a> */
$arr_links[$current_a] .= "/a>";

$goto_arr_links = false;
$close_a = true;
$phrase_holder .= "{%$current_a%}"; /* put a parameter holder on the phrase */
$current_a++;
}
elseif ($goto_arr_links == false) {
$phrase_holder .= $v;
}
}

echo "Links Array:\n";
print_r($arr_links);
echo "\n\n\nPhrase Holder:\n";
echo $phrase_holder;
echo "\n\n\n(pre) Final Phrase (with my keyword replaced):\n";
$final_phrase = str_replace("Moses", "<a href=\"novo-mega-link.php\">Moses</a>", $phrase_holder);
echo $final_phrase;
echo "\n\n\nFinal Phrase:\n";
foreach($arr_links as $k => $v)
{
$final_phrase = str_replace("{%$k%}", $v, $final_phrase);
}
echo $final_phrase;

The output:

Links Array:

Array
(
[0] => <a href="original-moses1.html">Moses</a>
[1] => <a href="original-moses2.html" target="_blank">Moses</a>
)

Phrase Holder:

Moses supposes his toeses are roses,
but {%0%} supposes erroneously;
for nobody's toeses are posies of roses,
as Moses supposes his toeses to be.
Ganda <span class="cenas">{%1%}</span>!

(pre) Final Phrase (with my keyword replaced):

<a href="novo-mega-link.php">Moses</a> supposes his toeses are roses,
but {%0%} supposes erroneously;
for nobody's toeses are posies of roses,
as <a href="novo-mega-link.php">Moses</a> supposes his toeses to be.
Ganda <span class="cenas">{%1%}</span>!

Final Phrase:

<a href="novo-mega-link.php">Moses</a> supposes his toeses are roses,
but <a href="original-moses1.html">Moses</a> supposes erroneously;
for nobody's toeses are posies of roses,
as <a href="novo-mega-link.php">Moses</a> supposes his toeses to be.
Ganda <span class="cenas"><a href="original-moses2.html" target="_blank">Moses</a></span>!

regex matching links without a tag

With all the disclaimers about using regex to parse html, if you want to use regex for this task, this will work:

$regex="~<a.*?</a>(*SKIP)(*F)|http://\S+~";

See the demo.

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <a ...tags </a> then deliberately fails, after which the engine skips to the next position in the string. The right side matches the urls, and we know they are the right ones because they were not matched by the expression on the left.

The url regex I put on the right and can be refined, just use whatever suits your needs.

Reference

  • How to match (or replace) a pattern except in situations s1, s2, s3...
  • Article about matching a pattern unless...

Regex to match keywords that aren't within a tags or alt attribs

by adding another lookahead after your search term? this is a very convoluted pattern, but seems like it would work:

Word(?![^<]*?>)(?!(?>[^<]*(?:<(?!/?a\b)[^<]*)*)</a>)

Explanation:

Word
(?! # that is not followed by
[^<]* # zero or more of anything anything that is not <
?> # lazily up to >
) # end lookahead

in <span class="Word">, [^<]*?> matches "> and fails because of the lookahead.

in <a href="/Word" alt="Word">, [^<]*?> matches " alt="Word"> and fails.

this part of the expression i'll leave up to poster from the thread to which it belongs to explain, since i'm not totally sure about a couple of the elements in it.

(?!(?>[^<]*(?:<(?!/?a\b)[^<]*)*)</a>)

Find a word not contained in a tag

Your regex fails if there is any a> in the line.


Skip the links like this by using (*SKIP)(*F) verbs | match word.

/<a[\s>][\s\S]*?\/a>(*SKIP)(*F)|stackoverflow/i

\s matches a whitespace, [\s\S] matches any character.

preg_replace keywords OUTSIDE of strong tags

You can use a SKIP-FAIL regex for to only replace something that is clearly outside on non-identical delimiters:

<strong>.*?<\/strong>(*SKIP)(*FAIL)|\b(boat|car)\b

See demo

Here is an IDEONE demo:

$str = "The man drove in his car.Then <strong>the man walked to the boat.</strong>"; 
$keywords = array('boat','car');
$p = implode('|', array_map('preg_quote', $keywords));
$result = preg_replace("#<strong>.*?<\/strong>(*SKIP)(*FAIL)|\b($p)\b#i", "gokart", $str);
echo $result;

NOTE that in this case, we most probably are not interested in a tempered greedy token solution inside the SKIP-FAIL block (that I posted initially, see revision history) since we do not care what is in between the delimiters.

preg_replace to exclude a href=''' /a PHP

If it must be done with regex I think PCRE verbs are your best option. Exclude all links then search for the term with word boundaries.

<a[\S\s]+?<\/a>(*SKIP)(*FAIL)|\bTERM\b

Demo: https://regex101.com/r/KlE1kc/1/

an example of a flaw with this though is if the a ever had a </a> in it. e.g. onclick='write("</a>")' a parser is really the best approach. There are a lot of gotchas with HTML and regexs.

Regex select all text between tags

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

preg_replace only OUTSIDE tags ? (... we're not talking full 'html parsing', just a bit of markdown)

Actually, this seems to work ok:

<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";

//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);

// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"

//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);

// this preserves the text before ($1), after ($4)
// and inside <strong>..</strong> ($2), but without the tags ($3)

// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"

?>

A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)



Related Topics



Leave a reply



Submit