Regex / DOMDocument - match and replace text not in a link
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Use DOMDocument to replace text with href link
As I can see a '
is misplaced,
<a href='http://test.de>test</a>'
This is not something you should want,
replace it with this,
<a href='http://test.de'>test</a>
As I've seen this Link, you may use preg_replace()
instead of preg_replace_dom()
at your last line of code.
Hope this helps.
PHP Regex replace link if it does not have data attribute
The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(@data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'
foreach($nodes as $element){
$textInside = $element->nodeValue; // get the text inside the link
$parentNode = $element->parentNode; // save parent node
$parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}
$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type
echo $myNewHTML;
Proof of concept: https://3v4l.org/ejatQ.
Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.
Replace all link tags containing given href attribute with Regex or DOM
OK, so here you are :
<?php
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet" href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css" href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';
$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");
foreach ($result as $link)
{
$href = $link->getattribute("href");
if ($href=="whatyouwanttofilter")
{
$link->parentNode->removeChild($link);
}
}
$output= $d->saveHTML();
echo $output;
?>
Tested and working. Have fun! :-)
The general idea is :
- Load your HTML into a
DOMDocument
- Look for
link
nodes, usingXPath
- Loop through the nodes
- Depending on the node's
href
attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol) - After doing all the cleaning-up, re-save the HTML and get it back into a string
Regular Expression for Replacing Content Not Inside HTML Tags
Don't use regexes to parse HTML. Use the PHP DOM:
$DOM = new DOMDocument;
$DOM->loadHTML($str); // Your HTML
//get all tds
$cells = $DOM->getElementsByTagName('td');
// Do stuff to the cells
//get all paragraphs
$paragraphs = $DOM->getElementsByTagName('p');
// Do stuff to the paragraphs
// Etc...
php DOMDocument preg_replace fail detect
I tried very, VERY hard to implement a DOMDocument+Xpath solution, but I came unstuck while trying to disqualify the text node within the square-tagged caption block. I couldn't manage to isolate the whole caption block to be able to exclude it. In the end, here is a caveman's regex approach to serve as a band-aid until someone smarter can solve this problem properly.
The regex matches the blacklisted tags in the text and discards them; it only replaces text that is not disqualified.
Code: (Demo)
$tags = ["拜登", "认真"];
$blacklisted = implode(
'|',
array_map(
fn($tag) => "<{$tag}[ >].+?" . ($tag === 'img' ? "/>" : "</$tag>"),
['a', 'img', 'iframe', 'figure', 'figcaption']
)
);
echo preg_replace(
sprintf('~(?:\[caption[ \]].+?\[/caption]|%s)(*SKIP)(*FAIL)|%s~us', $blacklisted, implode('|', $tags)),
'<span class="article-tag"><a class="mytag" href="http://outside.com">$0</a></span>',
$html
);
Regex with DOMDocument in PHP
Quick scan of the underlying engine code: it does not support pass-by-reference.
To get around that, use your own wrapper:
$xpath->registerNamespace('php', 'http://php.net/xpath');
$xpath->registerPHPFunctions('match');
$links = $xpath->query("a[php:functionString('match', @href)]/@href");
function match($href) {
$regex = '~\?v=([^&]+)~';
$rc = preg_match($regex, $href, $matches);
var_dump($matches[1]); // store this somewhere
return $rc;
}
See it live on 3v4l.org.
Related Topics
Can a User Alter the Value of $_Session in PHP
Error_Reporting(E_All) Does Not Produce an Error
PHP Array VS [ ] in Method and Variable Declaration
Load Block Outside Magento, and Apply Current Template
How to Loop Through a Multidimensional Array Without Knowing It's Depth
What's the Performance Cost of "Include" in PHP
How to Show PHP-Files as Plain Text in Apache
PHP - Get Last Week Number in Year
Sort Xml via Attribute Value PHP
Simple Comet Example Using PHP and Jquery
How Do Check If a PHP Session Is Empty
What's the Difference Between Is_Null($Var) and ($Var === Null)
PHP MySQL - Insert New Record into Table with Auto-Increment on Primary Key
Google Calendar API Service Account Error
How to Set for Specific Directory Open_Basedir
Warning: MySQLi_Real_Escape_String() Expects Exactly 2 Parameters, 1 Given... What I Do Wrong
How to Detect When a User Has Successfully Finished Downloading a File in PHP