Parsing Domain from a Url

Parsing domain from a URL

Check out parse_url():

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
echo $parse['host']; // prints 'google.com'

parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.

How to get domain name from URL

I once had to write such a regex for a company I worked for. The solution was this:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

  • Very fast if regex is optimally ordered

The downside of this solution is of course:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • Very large regex so not very readable.

Get domain name from full URL

Check the code below, it should do the job fine.

<?php

function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : $pieces['path'];
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}

print get_domain("http://mail.somedomain.co.uk"); // outputs 'somedomain.co.uk'

?>

How do I retrieve the domain from a URL?

I think since your examples has incorrect URLs as well, you need to use Regular Expresssion to extract the domain in the URL. Please find the sample code below to get the domain for the examples you shared:

package main

import (
"fmt"
"regexp"
)

// Main function
func main() {

// Finding regexp from the given string
// Using FindString() method
m := regexp.MustCompile(`\.?([^.]*.com)`)

fmt.Println(m.FindStringSubmatch("https://www.example.com/some-random-url")[1])
fmt.Println(m.FindStringSubmatch("www.example.com/some-random-url")[1])
fmt.Println(m.FindStringSubmatch("example.com/some-random-url")[1])
fmt.Println(m.FindStringSubmatch("www.example.com")[1])
fmt.Println(m.FindStringSubmatch("subdomain.example.com")[1])

}

Ideally, this covers all the cases (including incorrectly formed URLs). You can easily update RegEx if there is any URL that doesn't get parsed correctly.

Go Playground link for the above: here.

Get domain name from given url

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}

should do what you want.


Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

Your code as written fails for the valid URLs:

  • httpfoo/bar -- relative URL with a path component that starts with http.
  • HTTP://example.com/ -- protocol is case-insensitive.
  • //example.com/ -- protocol relative URL with a host
  • www/foo -- a relative URL with a path component that starts with www
  • wwwexample.com -- domain name that does not starts with www. but starts with www.

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:

Appendix B. Parsing a URI Reference with a Regular Expression


As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.

The following line is the regular expression for breaking-down a
well-formed URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9

The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis).

Parsing Domain Name only from URL In PHP

There is only one half-way reliable way to do this I think and you'll need to create a class for it; personally I use something like namespace\Domain extends namespace\URI sort of thing - a Domain, essentially being a subset of a URI - technically I create 2 classes.

Your domain will probably need a static class member to hold the list of valid TLDs and this may as well exist in the URI class as you may want to reuse it with other sub-classes.

namespace My;

class URI {

protected static $tldList;
private static $_tldRepository = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';

protected $uri;

public function __construct($sURI = "") {
if(!self::$tldList) {

//static method to load the TLD list from Mozilla
// and parse it into an array, which sets self::$tldList
self::loadTLDList();
}

//if the URI has been passed in - set it
if($sURI) $this->setURI($sURI);
}

public function setURI($sURI) {
$this->uri = $sURI; //needs validation and sanity checks of course
}

public function getURI() {
return $this->uri;
}


//other methods ...

}

In reality I actually make a copy of the TLD list to a file on the server and use that, and only update it every 6 months to avoid the overhead of reading in the full TLD list when you first create a URI object on any page.

Now you may have a Domain sub-class that extends \My\URI and allows you to break the URI down into component parts - there might be a method to remove the TLD (based on the TLD list you've loaded into parent::$tldList from mxr.mozilla.org) once you've taken out the valid TLD whatever is just to the left of it (between the last . and the TLD) should be the domain, anything left of that would be sub-domains.

You can have methods to extract that data as required as well.

Extract hostname name from string

There is no need to parse the string, just pass your URL as an argument to URL constructor:

const url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
const { hostname } = new URL(url);

console.assert(hostname === 'www.youtube.com');

Parsing Domainname From URL In PHP

The domain is stored in $_SERVER['HTTP_HOST'].

EDIT: I believe this returns the whole domain. To just get the top-level domain, you could do this:

// Add all your wanted subdomains that act as top-level domains, here (e.g. 'co.cc' or 'co.uk')
// As array key, use the last part ('cc' and 'uk' in the above examples) and the first part as sub-array elements for that key
$allowed_subdomains = array(
'cc' => array(
'co'
),
'uk' => array(
'co'
)
);

$domain = $_SERVER['HTTP_HOST'];
$parts = explode('.', $domain);
$top_level = array_pop($parts);

// Take care of allowed subdomains
if (isset($allowed_subdomains[$top_level]))
{
if (in_array(end($parts), $allowed_subdomains[$top_level]))
$top_level = array_pop($parts).'.'.$top_level;
}

$top_level = array_pop($parts).'.'.$top_level;


Related Topics



Leave a reply



Submit