Function to extract domain name from URL in R
I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to source
later or make your own package with it.
x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"
domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]
domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com" "google.com"
Return root domain from url in R
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url
function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract
returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
How to get domain name from URL
I once had to write such a regex for a company I worked for. The solution was this:
- Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
- Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
- Very fast if regex is optimally ordered
The downside of this solution is of course:
- Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
- Very large regex so not very readable.
Extract domain name without suffix or subdomain
You're working with domain names, so you may want to use some tools that were designed to do so:
library(urltools)
df <- data.frame(site=c("Google.com", "yahoo.in", "facebook.com", "badge.net"))
suffix_extract(df$site)
## host subdomain domain suffix
## 1 Google.com <NA> google com
## 2 yahoo.in <NA> yahoo in
## 3 facebook.com <NA> facebook com
## 4 badge.net <NA> badge net
for @Sotos:
urltools::suffix_extract('www.bankofcyprus.com')
## host subdomain domain suffix
## 1 www.bankofcyprus.com www bankofcyprus com
How to extract domain suffix?
uses
System.SysUtils;
var
u : string;
arr: TArray<string>;
begin
try
u := 'https://stackoverflow.com/questions/71166883/how-to-extract-domain-suffix';
arr := u.Split(['://'], TStringSplitOptions.ExcludeEmpty);
u := arr[High(arr)]; //stackoverflow.com/questions/71166883/how-to-extract-domain-suffix';
arr := u.Split(['/'], TStringSplitOptions.ExcludeEmpty);
u := arr[0]; //stackoverflow.com
arr := u.Split(['.'], TStringSplitOptions.ExcludeEmpty);
u := arr[High(arr)]; //com
writeln('Top-Level-Domain: ', u);
readln;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
Does R have any package for parsing out the parts of a URL?
Since parse_url()
uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub
call.
Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.
^(?:(?:[[:alpha:]+.-]+)://)?
will match the protocol header (copied fromparse_url()
), we are stripping this away if we find it- Also, a potential
www.
prefix is stripped away, but not captured:(?:www\\.)?
- Anything up to the subsequent slash will be our fully qualified host name, which we capture:
([^/]+)
- The rest we ignore:
.*$
Now we plug together the regexes above, and the extraction of the hostname becomes:
PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
Change host name regex to include (but not capture) the port:
HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
Data Studio calculated field: how to extract domain from url
Adapted the respective Google Sheets formula in the question to Google Data Studio, using the Calculated Field:
TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(URL, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:
how to extract the domain from a URL
This should work:
local url = "foo.bar.google.com"
local domain = url:match("[%w%.]*%.(%w+%.%w+)")
print(domain)
Output:google.com
The pattern [%w%.]*%.(%w+%.%w+)
looks for the content after the second dot .
from the end.
Related Topics
R Random Forests Variable Importance
Move a Column to First Position in a Data Frame
Running Multiple Linear Regressions Across Several Columns of a Data Frame in R
Create Top-To-Bottom Fade/Gradient Geom_Density in Ggplot2
Ggplot Aes_String Does Not Work Inside a Function
Ggplot2: Issues with Dual Y-Axes and Loess Smoothing
Bookmarking and Saving the Bookmarks in R Shiny
R Column Check If Contains Value from Another Column
Aggregation Using Ffdfdply Function in R
Compute All Fixed Window Averages with Dplyr and Rcpproll
Correlation Between Na Columns
How to Get Rows, by Group, of Data Frame with Earliest Timestamp
Relationship Between R Markdown, Knitr, Pandoc, and Bookdown
How to Remove Rows with All Zeros Without Using Rowsums in R
Azure Put Blob Authentication Fails in R