Regex/Domdocument - Match and Replace Text Not in a Link

Regex / DOMDocument - match and replace text not in a link

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:

1. find and replace keywords by hyperlinks in an html fragment, via php dom

2. Regex / DOMDocument - match and replace text not in a link

3. php problem with russian language

4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

Use DOMDocument to replace text with href link

As I can see a ' is misplaced,

<a href='http://test.de>test</a>'

This is not something you should want,
replace it with this,

<a href='http://test.de'>test</a>

As I've seen this Link, you may use preg_replace() instead of preg_replace_dom() at your last line of code.

Hope this helps.

PHP Regex replace link if it does not have data attribute

The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file

$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(@data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'

foreach($nodes as $element){
    $textInside = $element->nodeValue; // get the text inside the link
    $parentNode = $element->parentNode; // save parent node
    $parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}

$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type

echo $myNewHTML;

Proof of concept: https://3v4l.org/ejatQ.

Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.

Replace all link tags containing given href attribute with Regex or DOM

OK, so here you are :

<?php

$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet"  href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css"  href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';

$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");

foreach ($result as $link)
{
    $href = $link->getattribute("href");

    if ($href=="whatyouwanttofilter")
    {
          $link->parentNode->removeChild($link);
    }

}

$output= $d->saveHTML();
echo $output;

?>

Tested and working. Have fun! :-)

The general idea is :

Load your HTML into a DOMDocument
Look for link nodes, using XPath
Loop through the nodes
Depending on the node's href attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)
After doing all the cleaning-up, re-save the HTML and get it back into a string

Regular Expression for Replacing Content Not Inside HTML Tags

Don't use regexes to parse HTML. Use the PHP DOM:

$DOM = new DOMDocument;
$DOM->loadHTML($str); // Your HTML

//get all tds
$cells = $DOM->getElementsByTagName('td');

// Do stuff to the cells

//get all paragraphs
$paragraphs = $DOM->getElementsByTagName('p');

// Do stuff to the paragraphs

// Etc...

php DOMDocument preg_replace fail detect

I tried very, VERY hard to implement a DOMDocument+Xpath solution, but I came unstuck while trying to disqualify the text node within the square-tagged caption block. I couldn't manage to isolate the whole caption block to be able to exclude it. In the end, here is a caveman's regex approach to serve as a band-aid until someone smarter can solve this problem properly.

The regex matches the blacklisted tags in the text and discards them; it only replaces text that is not disqualified.

Code: (Demo)

$tags = ["拜登", "认真"];
$blacklisted = implode(
    '|',
    array_map(
        fn($tag) => "<{$tag}[ >].+?" . ($tag === 'img' ? "/>" : "</$tag>"),
        ['a', 'img', 'iframe', 'figure', 'figcaption']
    )
);
echo preg_replace(
         sprintf('~(?:\[caption[ \]].+?\[/caption]|%s)(*SKIP)(*FAIL)|%s~us', $blacklisted, implode('|', $tags)),
         '<span class="article-tag"><a class="mytag" href="http://outside.com">$0</a></span>',
         $html
     );

Regex with DOMDocument in PHP

Quick scan of the underlying engine code: it does not support pass-by-reference.

To get around that, use your own wrapper:

$xpath->registerNamespace('php', 'http://php.net/xpath');
$xpath->registerPHPFunctions('match');
$links = $xpath->query("a[php:functionString('match', @href)]/@href");

function match($href) {
    $regex = '~\?v=([^&]+)~';
    $rc = preg_match($regex, $href, $matches);
    var_dump($matches[1]); // store this somewhere
    return $rc;
}

See it live on 3v4l.org.

Regex/Domdocument - Match and Replace Text Not in a Link