How to Extract Http Links from a Paragraph and Store Them in a Array on PHP

How to extract http links from a paragraph and store them in a array on php

$text = 'Lorem ipsum http://thesite.com dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt https://www.thesite.com ut labore et dolore magna aliqua. Ut http://www.thesite.com enim ad minim veniam,';

$pattern = '!(https?://[^\s]+)!'; // refine this for better/more specific results

if (preg_match_all($pattern, $text, $matches)) {
list(, $links) = ($matches);
print_r($links);
}

Extract URL's from a string using PHP

REGEX is the answer for your problem. Taking the Answer of Object Manipulator.. all it's missing is to exclude "commas", so you can try this code that excludes them and gives 3 separated URL's as output:

$string = "The text you want to filter goes here. http://google.com, https://www.youtube.com/watch?v=K_m7NEDMrV0,https://instagram.com/hellow/";

preg_match_all('#\bhttps?://[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $string, $match);

echo "<pre>";
print_r($match[0]);
echo "</pre>";

and the output is

Array
(
[0] => http://google.com
[1] => https://www.youtube.com/watch?v=K_m7NEDMrV0
[2] => https://instagram.com/hellow/
)

Find links in page and run it through custom function

You can use preg_replace_callback instead of preg_replace http://nz.php.net/manual/en/function.preg-replace-callback.php

function link_it($text)
{
$text= preg_replace_callback("/(^|[\n ])([\w]*?)((ht|f)tp(s)?:\/\/[\w]+[^ \,\"\n\r\t<]*)/is", 'shorturl2full', $text);
$text= preg_replace_callback("/(^|[\n ])([\w]*?)((www|ftp)\.[^ \,\"\t\n\r<]*)/is", 'shorturl2full', $text);
$text= preg_replace_callback("/(^|[\n ])([a-z0-9&\-_\.]+?)@([\w\-]+\.([\w\-\.]+)+)/i", 'shorturl2full', $text);
return($text);
}

function shorturl2full($url)
{
$fullLink = 'FULLLINK';
// $url[0] is the complete match
//... you code to find the full link
return '<a href="' . $url[0] . '">' . $fullLink . '</a>';
}

Hope this helps

Finding urls from text string via php and regex?

$pattern = '#(www\.|https?://)?[a-z0-9]+\.[a-z0-9]{2,4}\S*#i';
preg_match_all($pattern, $str, $matches, PREG_PATTERN_ORDER);

Find links in string with PHP. Differ from normal and youtube links

First of all, ditch eregi. It's deprecated and will disappear soon.

Then, doing this in just one pass is maybe a stretch too far. I think you'll be better off splitting this into three phases.

Phase 1 runs a regex search over your input, finding everything that looks like a link, and storing it in a list.

Phase 2 iterates over the list, checking whether a link goes to youtube (parse_url is tremendously useful for this), and putting a suitable replacement into a second list.

Phase 3: you now have two lists, one containing the original matches, one containing the desired replacements. Run str_replace over your original text, providing the match list for the search parameter and the replacement list for the replacements.

There are several advantages to this approach:

  1. The regular expression for extracting links can be kept relatively simple, since it doesn't have to take special hostnames into account
  2. It is easier to debug; you can dump the search and replace arrays prior to phase 3, and see if they contain what you expect
  3. Because you perform all replacements in one go, you avoid problems with overlapping matches or replacing a piece of already-replaced text (after all, the replaced text still contains a URL, and you don't want to replace that again)

Get ul li a string values and store them in a variable or array php

Try this

$html = '<div class="coursesListed">
<ul>
<li><a href="#"><h3>Item one</h3></a></li>
<li><a href="#"><h3>item two</h3></a></li>
<li><a href="#"><h3>Item three</h3></a></li>
</ul>
</div>';

$doc = new DOMDocument();
$doc->loadHTML($html);
$liList = $doc->getElementsByTagName('li');
$liValues = array();
foreach ($liList as $li) {
$liValues[] = $li->nodeValue;
}

var_dump($liValues);

Web Crawler not following page's links

You should (as ususal) first of all make up your mind what you're actually doing.

As you outline in your question you're doing a text-search for URL patterns of the HTTP protocol. A common regex normally includes the https: URI scheme as well:

~https?://\S*~

That is everything until the first whitepspace. this normally does the job for detecting HTTP URLs of a wider range within a string. If you need something more advanced see the Stackover Q&A about making links of texts clickable:

  • How to match URIs in text?
  • How to extract http links from a paragraph and store them in a array on php

This still will not solve all of your crawler problems. For two reasons:

  1. Character encoding: If you want to properly do that, you need to know the correct character encoding of the string and make the regular expression fitting for it.
  2. That is text. Websites not only consist of text but also of HTML which carries its own semantics.

So actually doing text-analysis alone is not enough. You also need to parse HTML. That means you need to take the Base URI and resolve each other URI inside the document against it to obtain the list of all absolute links in that document.

You find this outlined in the following whitepaper:

  • 5. Reference Resolution in RFC3986: Uniform Resource Identifier (URI): Generic Syntax

For PHP the two most stable components to work with for this are:

  1. DOMDocument - A PHP extension to parse XML and HTML documents. Here you are looking for parsing HTML documents naturally.
  2. Net_Url2 - A PEAR extension to deal with URLs including RFC3986 conform reference resolution (the differences to the previous version you can safely ignore, the standard is pretty stable as the PHP library is, two minor bugs in very narrow and specific cases are still open but have patches).


Related Topics



Leave a reply



Submit