Extract All Urls Inside a String in Ruby

Extract all urls inside a string in Ruby

A different approach, from the perfect-is-the-enemy-of-the-good school of thought:

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

Extract all urls inside a string in Ruby

A different approach, from the perfect-is-the-enemy-of-the-good school of thought:

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

Extracting all URLs from a page using Ruby

Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:

require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[@href]').each do |a|
puts a.attr('href')
end

But if you really want to find only the absolute URLs, add a simple condition:

 puts a.attr('href') if a.attr('href') =~ /^http/i

Pull all URLS out of this string in Ruby with scan method

You can use Twitter text for extracting url's

Extract URLs from String (Ruby) (Regex and link shortened)

Here is one approach using match:

match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
puts match[1]
else
puts "no match"
end

Demo

If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.

How to extract URLs from text

What cases are failing?

According to the library regexpert, you can use

regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix

and then perform a scan on the text.

EDIT: Seems like the regexp supports the empty string. Just remove the initial (^$) and you're done

regex to extract URLs from text - Ruby

Find words who look like urls:

str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"

str.split.select{|w| w[/(\b+\.\w+)/]}

This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.

puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94

Updated

Complete solution to modify your string:

str_with_quote = str.clone # make a clone for the `gsub!`

str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}

Now your cloned object wraps urls inside double quotes

puts str_with_quote

Will give you this output

ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.

"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"

"www.jstor.org/stable/24084454"

"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"

"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"

"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Ruby regex: extract a list of urls from a string

The best answer will depend very much on exactly what input string you expect.

If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):

mystring.split('?v=3')

If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:

mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)

Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.

The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.

Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.

Extracting URLs from a String that do not contain 'http'

Use regular expressions :

Here is a basic one that should work for most cases :

/(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s

This will only fetch the first url in the string and return a string.



Related Topics



Leave a reply



Submit