How to get domain name from URL
I once had to write such a regex for a company I worked for. The solution was this:
- Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
- Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
- Very fast if regex is optimally ordered
The downside of this solution is of course:
- Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
- Very large regex so not very readable.
Get domain name from given url
If you want to parse a URL, use java.net.URI
. java.net.URL
has a bunch of problems -- its equals
method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI
instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
should do what you want.
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
Your code as written fails for the valid URLs:
httpfoo/bar
-- relative URL with a path component that starts withhttp
.HTTP://example.com/
-- protocol is case-insensitive.//example.com/
-- protocol relative URL with a hostwww/foo
-- a relative URL with a path component that starts withwww
wwwexample.com
-- domain name that does not starts withwww.
but starts withwww
.
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
If you really need to deal with messy inputs that java.net.URI
rejects, see RFC 3986 Appendix B:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.The following line is the regular expression for breaking-down a
well-formed URI reference into its components.^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis).
Get The Current Domain Name With Javascript (Not the path, etc.)
How about:
window.location.hostname
The location
object actually has a number of attributes referring to different parts of the URL
Get domain name from full URL
Check the code below, it should do the job fine.
<?php
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : $pieces['path'];
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
print get_domain("http://mail.somedomain.co.uk"); // outputs 'somedomain.co.uk'
?>
Extract domain name from URL in Python
Use tldextract
which is more efficient version of urlparse
, tldextract
accurately separates the gTLD
or ccTLD
(generic or country code top-level domain) from the registered domain
and subdomains
of a URL.
>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
Extract hostname name from string
There is no need to parse the string, just pass your URL as an argument to URL
constructor:
const url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
const { hostname } = new URL(url);
console.assert(hostname === 'www.youtube.com');
Extract main domain name from a given url
As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:
String url = "zoyanailpolish.blogspot.com";
String tld = findTLD( url ); // To be implemented. Add to helper class ?
url = url.replace( "." + tld,"");
int pos = url.lastIndexOf('.');
String mainDomain = "";
if (pos > 0 && pos < url.length() - 1) {
mainDomain = url.substring(pos + 1) + "." + tld;
}
// else: Main domain name comes out empty
The implementation details are left up to you.
Related Topics
Findviewbyid Returns Null in a Dialog
How to Resolve Host "<Url Here>" No Address Associated with Host Name
How to Change Color of the Back Arrow in the New Material Theme
Ubuntu: Openjdk 8 - Unable to Locate Package
R Error: Java.Lang.Outofmemoryerror: Java Heap Space
How to Convert a Char to a String
What Operations in Java Are Considered Atomic
Content Is Not Allowed in Prolog Saxparserexception
How to Hash Some String with Sha256 in Java
How to Iterate Through the Unicode Codepoints of a Java String
Date Object Simpledateformat Not Parsing Timestamp String Correctly in Java (Android) Environment
How to View My Realm File in the Realm Browser
Trying to Locate a Leak! What Does Anon Mean for Pmap
How to Make Rjava Use the Newer Version of Java on Osx
Nullpointerexception in Java with No Stacktrace
Redirect Console Output to String in Java
Jersey /* Servlet Mapping Causes 404 Error for Static Resources