Find all hrefs in page and replace with link maintaining previous link - PHP
Use PHP's DomDocument
to parse the page
$doc = new DOMDocument();
// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTML('<a href="http://www.google.com">Google</a>');
//Loop through each <a> tag in the dom and change the href property
foreach($doc->getElementsByTagName('a') as $anchor) {
$link = $anchor->getAttribute('href');
$link = 'http://www.example.com/?loadpage='.urlencode($link);
$anchor->setAttribute('href', $link);
}
echo $doc->saveHTML();
Check it out here: http://codepad.org/9enqx3Rv
If you don't have the HTML as a string, you may use cUrl (docs) to grab the HTML, or you can use the loadHTMLFile
method of DomDocument
Documentation
DomDocument
- http://php.net/manual/en/class.domdocument.phpDomElement
- http://www.php.net/manual/en/class.domelement.phpDomElement::getAttribute
- http://www.php.net/manual/en/domelement.getattribute.phpDOMElement::setAttribute
- http://www.php.net/manual/en/domelement.setattribute.phpurlencode
- http://php.net/manual/en/function.urlencode.phpDomDocument::loadHTMLFile
- http://www.php.net/manual/en/domdocument.loadhtmlfile.php- cURL - http://php.net/manual/en/book.curl.php
Replace all links in the body of html page using PHP
You can decapitate the code.
Finds the body and separate the head from the body to two variables.
//$output = file_get_contents($turl);
$output = "<head> blablabla
Bla bla
</head>
<body>
Foobar
</body>";
//Decapitation
$head = substr($output, 0, strpos($output, "<body>"));
$body = substr($output, strpos($output, "<body>"));
// Find body tag and parse body and head to each variable
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $body);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $head . $newOutput;
https://3v4l.org/WYcYP
Alter all a href links in php
once detected the urls you can use parse_url()
and parse_str()
to elaborate the url, add utm and medium and rebuild it without caring too much about the content of the get parameters or the hash:
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function ($matches) {
$link = $matches[0];
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return $result;
},
$html
);
As you can see, the code is longer but simpler
Edit
I made some change, searching for every href="xxx" inside the text. If the link is not from add-link.com the script will skip it, otherwise he will try to print it in the best way possible
$html = 'blabla <a href="http://add-link.com/">a</a>
<a href="http://add-link.com/">a</a>
<a href="http://add-link.com/#hashed">a</a>
<a href="http://abcd.com/#hashed">a</a>
<a href="http://add-link.com/?test=1">a</a>
<a href="http://add-link.com/try.php">a</a>
<a href="http://add-link.com/try.php?test=1">a</a>
<a href="http://add-link.com/try.php#hashed">a</a>
<a href="http://add-link.com/try.php?test=1#hashed">a</a>
<a href="http://add-link.com/try.php?test=1#hashed">a</a>
<a href="//add-link.com?test=test" style="color: rgb(198, 156, 109);">a</a>
';
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'/href="([^"]+)"/i',
function ($matches) {
$link = $matches[1];
// ignoring outer links
if(strpos($link,'add-link.com') === false) return 'href="'.$link.'"';
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
} else if(isset($res['host'])) {
$result .= '//';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
} else {
$result .= '/';
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return 'href="'.$result.'"';
},
$html
);
var_dump($html_text);
PHP file_get_contents - Replace all URLs in all a href= links
The PHP Code that works
PHP code that calls the file and replaces the links
<?php
$message = file_get_contents("myHTML.html");
$content = explode("\n", $message);
$URLs = array();
for($i=0;count($content)>$i;$i++)
{
if(preg_match('/<a href=/', $content[$i]))
{
list($Gone,$Keep) = explode("href=\"", trim($content[$i]));
list($Keep,$Gone) = explode("\">", $Keep);
$message= strtr($message, array( "$Keep" => "http://www.MyWesite.com/?link=$Keep", ));
}
}
echo $message;
?>
Replace all URLs in text to clickable links in PHP
function convert($input) {
$pattern = '@(http(s)?://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])@';
return $output = preg_replace($pattern, '<a href="http$2://$3">$0</a>', $input);
}
demo
Replace all link tags containing given href attribute with Regex or DOM
OK, so here you are :
<?php
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet" href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css" href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';
$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");
foreach ($result as $link)
{
$href = $link->getattribute("href");
if ($href=="whatyouwanttofilter")
{
$link->parentNode->removeChild($link);
}
}
$output= $d->saveHTML();
echo $output;
?>
Tested and working. Have fun! :-)
The general idea is :
- Load your HTML into a
DOMDocument
- Look for
link
nodes, usingXPath
- Loop through the nodes
- Depending on the node's
href
attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol) - After doing all the cleaning-up, re-save the HTML and get it back into a string
How to replace specific text with hyperlinks without modifying pre-existing img and a tags?
I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.
While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.
I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).
Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())
$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;
$keywords = [
'Kathryn Kuhlman' => 'https://www.example.com/en-354',
'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
'eneral' => 'https://www.example.com/this-is-not-used',
];
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
$lookup[strtolower($name)] = $link;
$regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;
foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement('a');
$a->setAttribute('href', $lookup[$fragmentLower]);
$a->setAttribute('title', $fragment);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim($dom->saveHTML()), 3, -4);
Output:
Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> & <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
Some explanatory points:
- I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
- A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
- A regex pattern is dynamically constructed and therefore should be
preg_quote()
ed to ensure that the pattern logic is upheld.b
is a word boundary metacharacter to prevent matching a substring in a longer word. Notice thateneral
is not replaced inGeneral
in the output. The case-insensitive flagi
will allow greater flexibility for this application and future applications. - My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of
<img>
or<a>
tags.
...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.
preg_split()
is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.The final text node in my sample will generate 4 elements:
0 => '
', // non-qualifying newline
1 => 'Max KANTCHEDE', // translatable string
2 => ' & ', // non-qualifying text
3 => 'Kathryn Kuhlman' // translatable string
For translatable strings, new
<a>
nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.For non-translatable strings, text nodes are created, then pushed into a temporary array.
If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.
In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading
<p>
and trailing</p>
tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just usesaveHTML()
without any hacking at the string.
Convert all Relative urls to Absolute urls while maintaining contents
You were right to use preg_replace, for your example you can try this code
// [^>]* means 0 or more quantifiers except for >
// single quote AND double quote support
$regex = '~<a([^>]*)href=["\']([^"\']*)["\']([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.example.com/$2"$3>';
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
$replaced = preg_replace($regex, $replace, $string);
Result
<p>this is text within string</p> and more random strings which contains link like <a href="http://www.example.com/docs/555text.fileextension">Download this file</a> <p>Other html follows where another relative link may exist like <a href="http://www.example.com/files/doc.doc">This file</a>
Related Topics
Type Hinting for Properties in PHP 7
Set Httponly and Secure on PHPsessid Cookie in PHP
Iterating Through a Stdclass Object in PHP
Cannot Initialize Mbstring with PHP 7
How to Determine the Extension(S) Associated with a Mime Type in PHP
PHP Sort($Array) Returning 1 Instead of Sorted Array
Set Maximum Execution Time in MySQL/Php
Using Mod_Rewrite with Xampp and Windows 7 - 64 Bit
In PHP What Does |= Mean, That Is Pipe Equals (Not Exclamation)
Regular Expressions: How to Express \W Without Underscore
How to Insert Multiple Rows in PHP Pdo MySQL
How to Join Three Tables in Codeigniter
Fatal Error: Uncaught Argumentcounterror: Too Few Arguments to Function