Regex to Remove the Webpage Part of a Url in Ruby

regex to remove the webpage part of a url in ruby

If your heart is set on using regex and you know that your URLs will be pretty straight forward you could use (.*)/.* to capture everything before the last / in your URL.

irb(main):007:0> url = "www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):008:0> regex = "(.*)/.*"
=> "(.*)/.*"
irb(main):009:0> url =~ /#{regex}/
=> 0
irb(main):010:0> $1
=> "www.example.com/home"

How do I remove a URL from a string in Ruby?

That seems to be working fine for a regular string:

my_str = "Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
puts "str before: #{my_str}" # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}" # => str after url sub: Top (" \l "Top)

But, yours might have some garbage, non-printable, characters in it. Take, for instance, a random null character right before the first slash:

#                   vv - random null character
my_str = "Top (http:\0//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
# looks the same vv
puts "str before: #{my_str}" # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}" # => str after url sub: Top (//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)

Now, if you try and copy and paste the output from this null character from the website, it will still work:

# I copied this from the output from the line below `looks the same vv`
my_str = 'Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)'
puts "str before: #{my_str}" # => str before: Top (http://www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM" \l "Top)
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}" # => str after url sub: Top (" \l "Top)

So it would end up looking like it works for us. So, you might try removing all non-printable characters and see if it works for you:

my_str = "Top (http:\0//www.lafayettefitness.org/Results/2011%20CHASING%20THE%20RAINBEAU%205K%20AGE%20GROU;5DP%20RESULTS.HTM\" \\l \"Top)"
my_str.gsub!(/[^[:print:]]/i, '')
my_str.gsub!(/#{URI::regexp}/, '')
puts "str after url sub: #{my_str}" # => str after url sub: Top (" \l "Top)

remove hostname and port from url using regular expression

To javascript you can use this code:

var URL = "http://localhost:7001/www.facebook.com";
var newURL = URL.replace (/^[a-z]{4,5}\:\/{2}[a-z]{1,}\:[0-9]{1,4}.(.*)/, '$1'); // http or https
alert (newURL);

Look at this code in action Here

Regards,
Victor

Ruby Regular expression to match a url

You can try this:

/https?:\/\/[\S]+/

The \S means any non-whitespace character.

(Rubular)

Getting parts of a URL (Regex)

A single regex to parse and breakup a
full URL including query parameters
and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:

/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

Getting all links of a webpage using Ruby

why you dont use groups in your pattern?
e.g.

/http[s]?:\/\/(.+)/i

so the first group will already be the link you searched for.

Simple regex to replace first part of URL

You could use a regex like this:

(https?://)(.*?)(/.*)

Working demo

Sample Image

As you can see in the Substitution section, you can use capturing group and concatenates the strings you want to generate the needed urls.

The idea of the regex is to capture the string before and after the domain and use \1 + staticpages + \3.

If you want to change the protocol to ftp, you could play with capturing group index and use this replacement string:

ftp://\2\3

So, you would have:

ftp://localhost:3000/something
ftp://www.domainname.com/something
ftp://domainname.com/something

How to parse a URL and extract the required substring

I'd do it this way:

require 'uri'

uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"

URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.



Related Topics



Leave a reply



Submit