Function to Extract Domain Name from Url in R

Function to extract domain name from URL in R

I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to source later or make your own package with it.

x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"

domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]

domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com" "google.com"

Return root domain from url in R

There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:

host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"

The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):

domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk

tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:

paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"

How to get domain name from URL

I once had to write such a regex for a company I worked for. The solution was this:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

  • Very fast if regex is optimally ordered

The downside of this solution is of course:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • Very large regex so not very readable.

Extract domain name without suffix or subdomain

You're working with domain names, so you may want to use some tools that were designed to do so:

library(urltools)

df <- data.frame(site=c("Google.com", "yahoo.in", "facebook.com", "badge.net"))

suffix_extract(df$site)
## host subdomain domain suffix
## 1 Google.com <NA> google com
## 2 yahoo.in <NA> yahoo in
## 3 facebook.com <NA> facebook com
## 4 badge.net <NA> badge net

for @Sotos:

urltools::suffix_extract('www.bankofcyprus.com')
## host subdomain domain suffix
## 1 www.bankofcyprus.com www bankofcyprus com

How to extract domain suffix?

uses
System.SysUtils;
var
u : string;
arr: TArray<string>;
begin
try
u := 'https://stackoverflow.com/questions/71166883/how-to-extract-domain-suffix';
arr := u.Split(['://'], TStringSplitOptions.ExcludeEmpty);
u := arr[High(arr)]; //stackoverflow.com/questions/71166883/how-to-extract-domain-suffix';
arr := u.Split(['/'], TStringSplitOptions.ExcludeEmpty);
u := arr[0]; //stackoverflow.com
arr := u.Split(['.'], TStringSplitOptions.ExcludeEmpty);
u := arr[High(arr)]; //com
writeln('Top-Level-Domain: ', u);
readln;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;

Does R have any package for parsing out the parts of a URL?

Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.

Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

  • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
  • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
  • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
  • The rest we ignore: .*$

Now we plug together the regexes above, and the extraction of the hostname becomes:

PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)

Change host name regex to include (but not capture) the port:

HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"

And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"

Data Studio calculated field: how to extract domain from url

Adapted the respective Google Sheets formula in the question to Google Data Studio, using the Calculated Field:

TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(URL, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))

Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

Sample Image

how to extract the domain from a URL

This should work:

local url = "foo.bar.google.com"
local domain = url:match("[%w%.]*%.(%w+%.%w+)")
print(domain)

Output:google.com

The pattern [%w%.]*%.(%w+%.%w+) looks for the content after the second dot . from the end.



Related Topics



Leave a reply



Submit