Extract all urls inside a string in Ruby
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Extract all urls inside a string in Ruby
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Extracting all URLs from a page using Ruby
Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:
require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[@href]').each do |a|
puts a.attr('href')
end
But if you really want to find only the absolute URLs, add a simple condition:
puts a.attr('href') if a.attr('href') =~ /^http/i
Pull all URLS out of this string in Ruby with scan method
You can use Twitter text for extracting url's
Extract URLs from String (Ruby) (Regex and link shortened)
Here is one approach using match
:
match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
puts match[1]
else
puts "no match"
end
Demo
If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.
How to extract URLs from text
What cases are failing?
According to the library regexpert, you can use
regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
and then perform a scan
on the text.
EDIT: Seems like the regexp supports the empty string. Just remove the initial (^$)
and you're done
regex to extract URLs from text - Ruby
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more .
characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"
Ruby regex: extract a list of urls from a string
The best answer will depend very much on exactly what input string you expect.
If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):
mystring.split('?v=3')
If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:
mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)
Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.
The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.
Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.
Extracting URLs from a String that do not contain 'http'
Use regular expressions :
Here is a basic one that should work for most cases :
/(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s
This will only fetch the first url in the string and return a string.
Related Topics
How to Tell Which Openssl Lib Is Actually Being Used by an Rvm-Installed Ruby
Open-Uri Returning Ascii-8Bit from Webpage Encoded in Iso-8859
In Rails, How to Access Response.Body in a Action Before It Returns
Wicked_Pdf: Footer Height/Styling
Getting the Siblings of a Node with Nokogiri
How to Input Multibyte Characters in Rails Console (Or Irb)
Running Heroku Console Does Not Start
How to Understand Sender and Receiver in Ruby
Use Rspec's "Expect" etc. Outside a Describe ... It Block
Multiple Servers in a Single Eventmachine Reactor
Rails 3 Actionmail Openssl::Ssl::Sslerror
Best Practices in Ruby for Loop
Graphql::Client::Dynamicqueryerror Expected Definition to Be Assigned to a Static Constant
How to Override Gemfile for Local Development