How to extract top-level domain name (TLD) from URL
No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it
is a subdomain (because Italy's registrar DOES sell domains such as co.it
) while zap.co.uk
isn't (because the UK's registrar DOESN'T sell domains such as co.uk
, but only like zap.co.uk
).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).
Extract top level domain from URL
You can try something like this:
((?<![^\/]\/)\b\w+\.\b\w{2,3}(?:\.\b\w{2})??)(?:$|\/)
Demo
Breaking Down the Pattern:
(?<![^\/]\/)
Ensures that the string is not preceded by a single slash (since/index.php
looks like a domain), but is okay to be preceded by double slashes (as inhttps://
)\b\w+\.
captures the main domain, ensuring that the entire string is a word by using a word boundary on the left and requiring a dot on the right. (again, issue with it capturing everything but thei
in/index.php
, which is why the\b
is required.)\b\w{2,3}
Matches the Top-level domain (.com
)(?:\.\b\w{2})?)
Optional, captures the country specific TLD if available(?:$|\/)
Requires that the entire match is followed by either the end of string$
or a forward slash\/
Alternative that uses lookahead instead of capture group:
(?<![^\/]\/)\b\w+\.\b\w{2,3}(?:\.\b\w{2})?(?=$|\/)
Essentially, you remove the capturing group, and replace the non-capturing group at the end (?:$|\/)
with a positive lookahead (?=$|\/)
.
Demo
Get .tld from URL via PHP
Use parse_url() function to get host part of the url then explode by .
and get last element of an array
Example below:
$url = 'http://www.example.com/site';
echo end(explode(".", parse_url($url, PHP_URL_HOST))); // echos "com"
Before that it would be nice to check if $url is actual URL with filter_var for example
EDIT:
$url = 'http://' . $_SERVER['SERVER_NAME'];
echo end(explode(".", parse_url($url, PHP_URL_HOST)));
// echos "com"
Extract Top Level Domain from Domain name
Updated to incorporate Traxo's point about the .
wildcard; I think my answer is a little fuller so I'll leave it up but we've both essentially come to the same solution.
//set up test variables
$aTLDList = ['ag', 'asia', 'asia_sunrise', 'com', 'com.ag', 'org.hn'];
$sDomain = "badgers.co.uk"; // for example
//build the match
$reMatch = '/^.*?\.(' . str_replace('.', '\.', implode('|', $aTLDList)) . ')$/';
$sMatchedTLD = preg_match($reMatch, $sDomain) ?
preg_replace($reMatch, "$1", $sDomain) :
"";
Resorting to Regular Expressions may be overkill but it makes for a concise example. This will give you either the TLD matched or an empty string in the $sMatchedTLD
variable.
The trick is to make the first .*
match ungreedy (.*?
) otherwise badgers.com.ag will match ag rather than com.ag.
Regex to extract the top level domain from a URL
You kind of mistook the words here... A TLD (Top Level Domain) refers to the last segment of a domain name or the part that follows immediately after the "dot" symbol. (E.g.: .com
, .net
, etc..)
What you're searching for is the second level domain (or SLD).
I've edited Daveo's answer for your question, so the match will be returned to the first capture group:
(?:[-a-zA-Z0-9@:%_\+~.#=]{2,256}\.)?([-a-zA-Z0-9@:%_\+~#=]*)\.[a-z]{2,6}\b(?:[-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)
Here is a demo: https://regex101.com/r/x2luiO/1
Explanation:
(?:[-a-zA-Z0-9@:%_\+~.#=]{2,256}\.)?
- This first part will get everything before your SLD (subdomains).([-a-zA-Z0-9@:%_\+~#=]*)
- This is your capturing group (Where the domain should be returned)\.[a-z]{2,6}
- This will match the TLD (if you also want to capture)\b(?:[-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)
- And this is the rest of the regex, that should match the port and/or the rest of the URL (/example/page/
).
It's also good to point that this regex will not match if you're testing a domain with the SLD and ccTLD (Country Code TLD) 'combo', example: .co.uk
and .co.it
, both are just the end of a domain for commercial and general websites, however, both will return co
as the SLD.
Related Topics
How to Convert a Numpy Array to Pil Image Applying Matplotlib Colormap
Python Postgres Psycopg2 Threadedconnectionpool Exhausted
Python Dictionary:Typeerror: Unhashable Type: 'List'
Django Submit Two Different Forms with One Submit Button
Framerate Affect the Speed of the Game
How to Use PDFminer as a Library
What Is the Fastest Way to Parse Large Xml Docs in Python
Deleting List Elements Based on Condition
How to Make Urllib2 Requests Through Tor in Python
High-Precision Clock in Python
In-Memory Size of a Python Structure
Is There a "Not Equal" Operator in Python
Type Hint for a Function That Returns Only a Specific Set of Values
Python Read JSON File and Modify
How to Run a Function Periodically in Python
How to Equalize the Scales of X-Axis and Y-Axis in Matplotlib