Find All Hrefs in Page and Replace with Link Maintaining Previous Link - PHP

Find all hrefs in page and replace with link maintaining previous link - PHP

Use PHP's DomDocument to parse the page

$doc = new DOMDocument();

// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTML('<a href="http://www.google.com">Google</a>');

//Loop through each <a> tag in the dom and change the href property
foreach($doc->getElementsByTagName('a') as $anchor) {
$link = $anchor->getAttribute('href');
$link = 'http://www.example.com/?loadpage='.urlencode($link);
$anchor->setAttribute('href', $link);
}
echo $doc->saveHTML();

Check it out here: http://codepad.org/9enqx3Rv

If you don't have the HTML as a string, you may use cUrl (docs) to grab the HTML, or you can use the loadHTMLFile method of DomDocument

Documentation

  • DomDocument - http://php.net/manual/en/class.domdocument.php
  • DomElement - http://www.php.net/manual/en/class.domelement.php
  • DomElement::getAttribute - http://www.php.net/manual/en/domelement.getattribute.php
  • DOMElement::setAttribute - http://www.php.net/manual/en/domelement.setattribute.php
  • urlencode - http://php.net/manual/en/function.urlencode.php
  • DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
  • cURL - http://php.net/manual/en/book.curl.php

Replace all links in the body of html page using PHP

You can decapitate the code.

Finds the body and separate the head from the body to two variables.

//$output = file_get_contents($turl);

$output = "<head> blablabla

Bla bla
</head>
<body>
Foobar
</body>";

//Decapitation
$head = substr($output, 0, strpos($output, "<body>"));
$body = substr($output, strpos($output, "<body>"));
// Find body tag and parse body and head to each variable

$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $body);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);

echo $head . $newOutput;

https://3v4l.org/WYcYP

Alter all a href links in php

once detected the urls you can use parse_url() and parse_str() to elaborate the url, add utm and medium and rebuild it without caring too much about the content of the get parameters or the hash:

$url_modifier_domain = preg_quote('add-link.com');

$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function ($matches) {
$link = $matches[0];
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);

$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}

$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';

if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}

return $result;
},
$html
);

As you can see, the code is longer but simpler

Edit
I made some change, searching for every href="xxx" inside the text. If the link is not from add-link.com the script will skip it, otherwise he will try to print it in the best way possible

$html = 'blabla <a href="http://add-link.com/">a</a>
<a href="http://add-link.com/">a</a>
<a href="http://add-link.com/#hashed">a</a>
<a href="http://abcd.com/#hashed">a</a>
<a href="http://add-link.com/?test=1">a</a>
<a href="http://add-link.com/try.php">a</a>
<a href="http://add-link.com/try.php?test=1">a</a>
<a href="http://add-link.com/try.php#hashed">a</a>
<a href="http://add-link.com/try.php?test=1#hashed">a</a>
<a href="http://add-link.com/try.php?test=1#hashed">a</a>
<a href="//add-link.com?test=test" style="color: rgb(198, 156, 109);">a</a>
';

$url_modifier_domain = preg_quote('add-link.com');

$html_text = preg_replace_callback(
'/href="([^"]+)"/i',
function ($matches) {
$link = $matches[1];

// ignoring outer links
if(strpos($link,'add-link.com') === false) return 'href="'.$link.'"';

if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);

$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
} else if(isset($res['host'])) {
$result .= '//';
}

if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
} else {
$result .= '/';
}

if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}

$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';

if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}

return 'href="'.$result.'"';
},
$html
);

var_dump($html_text);

PHP file_get_contents - Replace all URLs in all a href= links

The PHP Code that works

PHP code that calls the file and replaces the links

<?php

$message = file_get_contents("myHTML.html");

$content = explode("\n", $message);

$URLs = array();

for($i=0;count($content)>$i;$i++)
{
if(preg_match('/<a href=/', $content[$i]))
{
list($Gone,$Keep) = explode("href=\"", trim($content[$i]));
list($Keep,$Gone) = explode("\">", $Keep);
$message= strtr($message, array( "$Keep" => "http://www.MyWesite.com/?link=$Keep", ));
}
}

echo $message;

?>

Replace all URLs in text to clickable links in PHP

function convert($input) {
$pattern = '@(http(s)?://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])@';
return $output = preg_replace($pattern, '<a href="http$2://$3">$0</a>', $input);
}

demo

Replace all link tags containing given href attribute with Regex or DOM

OK, so here you are :

<?php

$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet" href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css" href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';

$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");

foreach ($result as $link)
{
$href = $link->getattribute("href");

if ($href=="whatyouwanttofilter")
{
$link->parentNode->removeChild($link);
}

}

$output= $d->saveHTML();
echo $output;

?>

Tested and working. Have fun! :-)


The general idea is :

  • Load your HTML into a DOMDocument
  • Look for link nodes, using XPath
  • Loop through the nodes
  • Depending on the node's href attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)
  • After doing all the cleaning-up, re-save the HTML and get it back into a string

How to replace specific text with hyperlinks without modifying pre-existing img and a tags?

I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.

While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.

I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).

Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;

$keywords = [
'Kathryn Kuhlman' => 'https://www.example.com/en-354',
'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
'eneral' => 'https://www.example.com/this-is-not-used',
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
$lookup[strtolower($name)] = $link;
$regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;

foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement('a');
$a->setAttribute('href', $lookup[$fragmentLower]);
$a->setAttribute('title', $fragment);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim($dom->saveHTML()), 3, -4);

Output:

Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> & <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>

Some explanatory points:

  • I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
  • A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
  • A regex pattern is dynamically constructed and therefore should be preg_quote()ed to ensure that the pattern logic is upheld. b is a word boundary metacharacter to prevent matching a substring in a longer word. Notice that eneral is not replaced in General in the output. The case-insensitive flag i will allow greater flexibility for this application and future applications.
  • My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of <img> or <a> tags.

...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.

  • preg_split() is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.

    • The final text node in my sample will generate 4 elements:

      0 => '
      ', // non-qualifying newline
      1 => 'Max KANTCHEDE', // translatable string
      2 => ' & ', // non-qualifying text
      3 => 'Kathryn Kuhlman' // translatable string
  • For translatable strings, new <a> nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.

  • For non-translatable strings, text nodes are created, then pushed into a temporary array.

  • If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.

  • In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading <p> and trailing </p> tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just use saveHTML() without any hacking at the string.

Convert all Relative urls to Absolute urls while maintaining contents

You were right to use preg_replace, for your example you can try this code

// [^>]* means 0 or more quantifiers except for >
// single quote AND double quote support
$regex = '~<a([^>]*)href=["\']([^"\']*)["\']([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.example.com/$2"$3>';

$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
$replaced = preg_replace($regex, $replace, $string);

Result

<p>this is text within string</p> and more random strings which contains link like <a href="http://www.example.com/docs/555text.fileextension">Download this file</a> <p>Other html follows where another relative link may exist like <a href="http://www.example.com/files/doc.doc">This file</a>


Related Topics



Leave a reply



Submit