Find and Replace Keywords by Hyperlinks in an HTML Fragment, via PHP Dom

find and replace keywords by hyperlinks in an html fragment, via php dom

That's somewhat tricky, but you could do it this way:

$html = <<< HTML
<div><p>The CEO of the Dexia bank <em>has</em> just decided to retire.</p></div>
HTML;

I've added an emphasis element just to illustrate that it works with inline elements too.

Setup

$dom = new DOMDocument;
$dom->formatOutput = TRUE;
$dom->loadXML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//text()[contains(., "Dexia")]');

The interesting thing above is the XPath of course. It queries the loaded DOM for all DOMText nodes containing the needle "Dexia". The result is DOMNodeList (as usual).

The replacement

foreach($nodes as $node) {
$link = '<a href="info.php?tag=dexia">Dexia</a>';
$replaced = str_replace('Dexia', $link, $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
echo $dom->saveXML($dom->documentElement);

The found $node will contain the string The CEO of the Dexia bank for wholeText, despite it being inside the P element. That is because the $node has a sibling DOMElement with the emphasis after bank. I am creating the link as a string instead of a node and replace all occurences of "Dexia" (regardless of word boundary - that would be a good call for Regex) in the wholeText with it. Then I create a DocumentFragment from the resulting string and replace the DOMText node with it.

W3C vs PHP

Using DocumentFragement::applyXML() is a non-standard approach, because the method is not part of the W3C DOM Specs.

If you would want to do the replacement with the standard API, you'd first have to create the A Element as a new DOMElement. Then you'd have to find the offset of "Dexia" in the nodeValue of the DOMText and split the DOMText Node into two nodes at that position. Remove Dexia from the returned sibling and insert the Link Element, before the second one. Repeat this procedure with the sibling node until no more Dexia strings are found in the node. Here is how to do it for one occurence of Dexia:

foreach($nodes as $node) {
$link = $dom->createElement('a', 'Dexia');
$link->setAttribute('href', 'info.php?tag=dexia');
$offset = strpos($node->nodeValue, 'Dexia');
$newNode = $node->splitText($offset);
$newNode->deleteData(0, strlen('Dexia'));
$node->parentNode->insertBefore($link, $newNode);
}

And finally the output

<div>
<p>The CEO of the <a href="info.php?tag=dexia">Dexia</a> bank <em>has</em> just decided to retire.</p>
</div>

How to replace specific text with hyperlinks without modifying pre-existing img and a tags?

I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.

While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.

I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).

Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;

$keywords = [
'Kathryn Kuhlman' => 'https://www.example.com/en-354',
'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
'eneral' => 'https://www.example.com/this-is-not-used',
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
$lookup[strtolower($name)] = $link;
$regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;

foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement('a');
$a->setAttribute('href', $lookup[$fragmentLower]);
$a->setAttribute('title', $fragment);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim($dom->saveHTML()), 3, -4);

Output:

Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> & <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>

Some explanatory points:

  • I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
  • A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
  • A regex pattern is dynamically constructed and therefore should be preg_quote()ed to ensure that the pattern logic is upheld. b is a word boundary metacharacter to prevent matching a substring in a longer word. Notice that eneral is not replaced in General in the output. The case-insensitive flag i will allow greater flexibility for this application and future applications.
  • My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of <img> or <a> tags.

...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.

  • preg_split() is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.

    • The final text node in my sample will generate 4 elements:

      0 => '
      ', // non-qualifying newline
      1 => 'Max KANTCHEDE', // translatable string
      2 => ' & ', // non-qualifying text
      3 => 'Kathryn Kuhlman' // translatable string
  • For translatable strings, new <a> nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.

  • For non-translatable strings, text nodes are created, then pushed into a temporary array.

  • If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.

  • In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading <p> and trailing </p> tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just use saveHTML() without any hacking at the string.

Regex / DOMDocument - match and replace text not in a link

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:

1. find and replace keywords by hyperlinks in an html fragment, via php dom

2. Regex / DOMDocument - match and replace text not in a link

3. php problem with russian language

4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

How can I replace all the occurrences of the $keyword within the string without replacing the keywords found within URLs and image tags?

Just working with strings I would just do the following since all the attribute values always come before the element value it's easy to get the right match, then just use a callback to replace 'sports' with whatever you like.

probably more what you need:

function replacer($match)
{
global $replace_match_with_this, $string_to_replace;
return str_ireplace($string_to_replace, $replace_match_with_this, $match[0]);
}

$new_string = preg_replace_callback(sprintf('/>[^<>]*[\s-]+%s[\s-]+[^<>]*<\/a>/i', $keyword), 'replacer', $string, 1);

presumably $keyword and $string_to_replace hold the same value and can be combined into one variable.



Related Topics



Leave a reply



Submit