PHP Regex to Match Outside of HTML Tags

php regex to match outside of html tags

You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:

/(asf|foo|barr)(?=[^>]*(<|$))/

See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.

Regular expression to match text outside html tags and not between specific tag

use this pattern to skip/ fail everything between <h1></h1>

Updated per comment below

<h1>[^<>]*<\/h1>(*SKIP)(*F)|(\bsample|text\b)(?=[^>]*(?:<|$))

Demo

PHP Regular expression to match keyword outside HTML tag <a>

I managed to do what I wanted (without using Regex) by:

parsing each character of my string
removing all <a> tags (copying them to a temporary array and keeping a placeholder on the string)
str_replace the new string in order to replace all the keywords
repopulating the placeholders by it's original <a> tags

Here's the code I used, in case someone else needs it:

$str = <<<STRA
Moses supposes his toeses are roses,
but <a href="original-moses1.html">Moses</a> supposes erroneously;
for nobody's toeses are posies of roses,
as Moses supposes his toeses to be.
Ganda <span class="cenas"><a href="original-moses2.html" target="_blank">Moses</a></span>!
STRA;

$arr1 = str_split($str);

$arr_links = array();
$phrase_holder = '';
$current_a = 0;
$goto_arr_links = false;
$close_a = false;

foreach($arr1 as $k => $v)
{
    if ($close_a == true)
    {
        if ($v == '>') {
            $close_a = false;
        } 
        continue;
    }

    if ($goto_arr_links == true)
    {
        $arr_links[$current_a] .= $v;
    }

    if ($v == '<' && $arr1[$k+1] == 'a') { /* <a */
        // keep collecting every char until </a>
        $arr_links[$current_a] .= $v;
        $goto_arr_links = true;
    } elseif ($v == '<' && $arr1[$k+1] == '/' && $arr1[$k+2] == 'a' && $arr1[$k+3] == '>' ) { /* </a> */
        $arr_links[$current_a] .= "/a>";

        $goto_arr_links = false;
        $close_a = true;
        $phrase_holder .= "{%$current_a%}"; /* put a parameter holder on the phrase */
        $current_a++;
    }    
    elseif ($goto_arr_links == false) {
        $phrase_holder .= $v;
    }
}

echo "Links Array:\n";
print_r($arr_links);
echo "\n\n\nPhrase Holder:\n";
echo $phrase_holder;
echo "\n\n\n(pre) Final Phrase (with my keyword replaced):\n";
$final_phrase = str_replace("Moses", "<a href=\"novo-mega-link.php\">Moses</a>", $phrase_holder);
echo $final_phrase;
echo "\n\n\nFinal Phrase:\n";
foreach($arr_links as $k => $v)
{
    $final_phrase = str_replace("{%$k%}", $v, $final_phrase);
}
echo $final_phrase;

The output:

Links Array:

Array
(
    [0] => <a href="original-moses1.html">Moses</a>
    [1] => <a href="original-moses2.html" target="_blank">Moses</a>
)

Phrase Holder:

Moses supposes his toeses are roses,
but {%0%} supposes erroneously;
for nobody's toeses are posies of roses,
as Moses supposes his toeses to be.
Ganda <span class="cenas">{%1%}</span>!

(pre) Final Phrase (with my keyword replaced):

<a href="novo-mega-link.php">Moses</a> supposes his toeses are roses,
but {%0%} supposes erroneously;
for nobody's toeses are posies of roses,
as <a href="novo-mega-link.php">Moses</a> supposes his toeses to be.
Ganda <span class="cenas">{%1%}</span>!

Final Phrase:

<a href="novo-mega-link.php">Moses</a> supposes his toeses are roses,
but <a href="original-moses1.html">Moses</a> supposes erroneously;
for nobody's toeses are posies of roses,
as <a href="novo-mega-link.php">Moses</a> supposes his toeses to be.
Ganda <span class="cenas"><a href="original-moses2.html" target="_blank">Moses</a></span>!

PHP regex to match HTML tag names except some tags

<(?:(?!input)[^>])*>(?:<\/[^>]*>)?

Try this.See demo.

https://www.regex101.com/r/fG5pZ8/13

$re = "/<(?:(?!input)[^>])*>(?:<\\/[^>]*>)?/im";
$str = "<input type=\"text\">\n<img src=\">\n<a href=\"\">\n<button type=\"button\"></button>\n<div id=\"some\"></div>\n<p></p>";

preg_match_all($re, $str, $matches);

Edit:

Use

(?!<input)<([A-Z0-9a-z]+)([^>]*>)?

If you want to save tag separately.

https://www.regex101.com/r/fG5pZ8/16

Extract text outside html tags

You can use PHP's DOMDocument and DOMXPath to get the values that you want. The trick is to wrap the HTML from your database in a (for example) <div> tag, and you can then load it into a DOMDocument and use DOMXPath to search for children of the <div> tag which are purely text using the text() path:

$html = 'This should be extracted <p>I do not want this</p> This should also be extracted <a>This may appear after other tags and I do not want this</a>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$texts = array();
foreach ($xpath->query('/div/text()') as $text) {
    $texts[] = $text->nodeValue;
}
print_r($texts);

Output:

Array ( 
    [0] => This should be extracted
    [1] =>  This should also be extracted 
)

Demo on 3v4l.org

Regex replace text outside script tag

My pattern will use (*SKIP)(*FAIL) to disqualify matched script tags and their contents.

text and simple will be match on every qualifying occurrence.

Regex Pattern: ~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

Code: (Demo)

$strings=['This has no replacements',
    'This simple text has no script tag',
    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
    '<script language="javascript">simple simple text text</script> this text starts with a script tag'
];

$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);

var_export($strings);

Output:

array (
  0 => 'This has no replacements',
  1 => 'This ***replaced*** ***replaced*** has no script tag',
  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)

Match text both inside and outside html tags, with grouping

I would replace the .*? everywhere with what you are really looking for.

The regular expression could be this:

(?=.+)((<([^>]+)>)?([^<]+)?(<\/([^>]+)>)?)

(?=.+) will make sure the match starts with something. All our capture groups are optional here, so to avoid an extra null match at the end we'll use this lookahead
When finding the tagname: [^>]+
When finding text in tags: [^<]+
([^<]+)? makes text within spans optional

Regex101 playground:

https://regex101.com/r/1caMOA/2

Regex replace text outside html tags

Okay, try using this regex:

(text|simple)(?![^<]*>|[^<>]*</)

Example worked on regex101.

Breakdown:

(         # Open capture group
  text    # Match 'text'
|         # Or
  simple  # Match 'simple'
)         # End capture group
(?!       # Negative lookahead start (will cause match to fail if contents match)
  [^<]*   # Any number of non-'<' characters
  >       # A > character
|         # Or
  [^<>]*  # Any number of non-'<' and non-'>' characters
  </      # The characters < and /
)         # End negative lookahead.

The negative lookahead will prevent a match if text or simple is between html tags.

PHP Regex to remove HTML-Tags inside <pre></pre> code blocks

You will need to use preg_replace_callback and call strip_tags in callback body:

preg_replace_callback('~(<pre[^>]*>)([\s\S]*?)(</pre>)~',
function ($m) { return $m[1] . strip_tags($m[2], ['p', 'b', 'strong']) . $m[3]; },
$s);

Some text.
<pre>
a = 5
b = 3
</pre>
More text
<pre>
a2 = "text"
b = 3
</pre>
final text

Note that above strip_tags strips all tags except p, b and strong.

RegEx Details:

(<pre[^>]*>): Match <pre...> and capture in group #1
([\s\S]*?): Match 0 or or more of any character including newline (lazy), capture this in group $2. [\s\S] matches any character including newline.
(</pre>): Match </pre> and capture in group #3

PHP Regex to Match Outside of HTML Tags