Get All Hrefs from String But Then Replace Via Another Method

Get all hrefs from string but then replace via another method

To call a function with the regex matches you can use the function preg_replace_callback http://php.net/manual/en/function.preg-replace-callback.php. something like:

function modify_href( $matches ) {
return $matches[1] . '/modified';
}

$result = preg_replace_callback('/(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)/', 'modify_href', $string);

I havent tested this, but I think it should work. I got the regex from here: https://rushi.wordpress.com/2008/04/14/simple-regex-for-matching-urls/

Find all hrefs in page and replace with link maintaining previous link - PHP

Use PHP's DomDocument to parse the page

$doc = new DOMDocument();

// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTML('<a href="http://www.google.com">Google</a>');

//Loop through each <a> tag in the dom and change the href property
foreach($doc->getElementsByTagName('a') as $anchor) {
$link = $anchor->getAttribute('href');
$link = 'http://www.example.com/?loadpage='.urlencode($link);
$anchor->setAttribute('href', $link);
}
echo $doc->saveHTML();

Check it out here: http://codepad.org/9enqx3Rv

If you don't have the HTML as a string, you may use cUrl (docs) to grab the HTML, or you can use the loadHTMLFile method of DomDocument

Documentation

  • DomDocument - http://php.net/manual/en/class.domdocument.php
  • DomElement - http://www.php.net/manual/en/class.domelement.php
  • DomElement::getAttribute - http://www.php.net/manual/en/domelement.getattribute.php
  • DOMElement::setAttribute - http://www.php.net/manual/en/domelement.setattribute.php
  • urlencode - http://php.net/manual/en/function.urlencode.php
  • DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
  • cURL - http://php.net/manual/en/book.curl.php

How can I replace all internal urls in a string of html with their relative external url?

Use

.replace(/\b((?:href|src)=)(?!\/\/example\.com)(["']?)([^"']+)\2/gi, 
(_,x,y,z) => z.charAt(0) == '/' ?
`${x}${y}//example.com${z}${y}` : `${x}${y}//example.com/${z}${y}`)

See regex proof.

Explanation

--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
href 'href'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
src 'src'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
example 'example'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
com 'com'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
["']? any character of: '"', ''' (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
[^"']+ any character except: '"', ''' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\2 what was matched by capture \2

const string = ' href="nowhere"  src="/nothing.js"';
const rx = /\b((?:href|src)=)(?!\/\/example\.com)(["']?)([^"']+)\2/gi;
console.log(string.replace(rx, (_,x,y,z) => z.charAt(0) == '/' ?
`${x}${y}//example.com${z}${y}` : `${x}${y}//example.com/${z}${y}`));

jQuery: Change all href values dynamically

You could actually bind this event to a click. So when clickinng the link it will run this function and change the link. This would then change any new link also as long as this is bound to a parent or the document for example below

$(document).on('click', 'a', function(e){
e.preventDefault();

var link = $(this).attr( 'href' );
link = link.replace("/link", "https://sub2.domain1.com/link");
window.location.href = link;
});

jQuery: replace If this HREF contains

There's a couple of issues in your code. Firstly $= in the attribute selector is for 'ends with' matches. To match at the start of the attribute value use ^=.

Secondly you need to replace the existing value, not overwrite the entire thing with the new URL only.

Lastly, you can simplify the logic by providing a function to attr() which is executed against all selected a elements instead of an explicit each() loop.

With all that said, try this:

jQuery($ => {
$('.post-body-inner a').attr('href', (i, h) => h.replace('https://redirect.affiliatelink1.com', 'https://api.affiliatelink2.com'));
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="post-body-inner">
<a href="https://redirect.affiliatelink1.com/foo">Foo</a>
</div>

How to replace specific text with hyperlinks without modifying pre-existing img and a tags?

I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.

While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.

I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).

Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;

$keywords = [
'Kathryn Kuhlman' => 'https://www.example.com/en-354',
'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
'eneral' => 'https://www.example.com/this-is-not-used',
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
$lookup[strtolower($name)] = $link;
$regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;

foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
$newNodes = [];
$hasReplacement = false;
foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
$fragmentLower = strtolower($fragment);
if (isset($lookup[$fragmentLower])) {
$hasReplacement = true;
$a = $dom->createElement('a');
$a->setAttribute('href', $lookup[$fragmentLower]);
$a->setAttribute('title', $fragment);
$a->nodeValue = $fragment;
$newNodes[] = $a;
} else {
$newNodes[] = $dom->createTextNode($fragment);
}
}
if ($hasReplacement) {
$newFragment = $dom->createDocumentFragment();
foreach ($newNodes as $newNode) {
$newFragment->appendChild($newNode);
}
$textNode->parentNode->replaceChild($newFragment, $textNode);
}
}
echo substr(trim($dom->saveHTML()), 3, -4);

Output:

Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> & <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>

Some explanatory points:

  • I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
  • A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
  • A regex pattern is dynamically constructed and therefore should be preg_quote()ed to ensure that the pattern logic is upheld. b is a word boundary metacharacter to prevent matching a substring in a longer word. Notice that eneral is not replaced in General in the output. The case-insensitive flag i will allow greater flexibility for this application and future applications.
  • My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of <img> or <a> tags.

...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.

  • preg_split() is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.

    • The final text node in my sample will generate 4 elements:

      0 => '
      ', // non-qualifying newline
      1 => 'Max KANTCHEDE', // translatable string
      2 => ' & ', // non-qualifying text
      3 => 'Kathryn Kuhlman' // translatable string
  • For translatable strings, new <a> nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.

  • For non-translatable strings, text nodes are created, then pushed into a temporary array.

  • If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.

  • In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading <p> and trailing </p> tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just use saveHTML() without any hacking at the string.

change all href of external page after insert it

You could do a simple string replace like:

str_replace('href="', 'href="http://www.example.com/', $string);

or with domdocument:

$page = '<html><head></head><body><a href="simple"></a><h1>Hi</h1><a href="simple2"></a></body></html>';
$doc = new DOMDocument();
$doc->loadHTML($page);
$as = $doc->getElementsByTagName('a');
foreach($as as $a){
$a->setAttribute('href', 'http://www.example.com/' . $a->getAttribute('href'));
}
print_r($doc->saveHTML());

output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head></head><body><a href="http://www.example.com/simple"></a><h1>Hi</h1><a href="http://www.example.com/simple2"></a></body></html>

This doesn't take into account absolute paths, you'll need a regex approach for that..

If the quote types vary you also will need to use a regex for the str_replace example. Can do something like('|") for that then use $1 to match the quote type.



Related Topics



Leave a reply



Submit