Extract Parent Domain Name from a List of Url Through Bash Shellscripting

Extract parent domain name from a list of url through Bash ShellScripting

Using awk

awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." file|sort -u

contatoruy.in
dicadodia.com.br
doomyjupe.com
forterins.com
gaelsyaray.com
livrariacultura.com.br
maxxivrimoveis.com.br
meguiatramandai.com.br
prategama.com
quetxviii.com
smilecire.com
suleacatan.com
theirpoem.com
toneyvaws.com
visionwebmkt.com
woadsbevy.com
yournjuju.com
zonalrems.com
zrobimystrone.pl

Extract parent domain/subdomain name from a list of url through Bash ShellScripting

You can use awk,

awk -F/ '{sub(/^www\.?/,"",$3); print $3}' yourfile

Test:

$ awk -F/ '{sub(/^www\.?/,"",$3); print $3}' yourfile
example.com
example2.com
example3.com
subdomain.example4.com
subdomain.example5.com

Extract parent domain/subdomain name from a list of url through Bash ShellScripting

You can use awk,

awk -F/ '{sub(/^www\.?/,"",$3); print $3}' yourfile

Test:

$ awk -F/ '{sub(/^www\.?/,"",$3); print $3}' yourfile
example.com
example2.com
example3.com
subdomain.example4.com
subdomain.example5.com

How to retrieve main domain from random subdomain in bash

I'm not an expert on domain names - Based on https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains, with minor exception, all domains with 2 letter suffix will have main domain of something.bb.cc, and all other suffix (usually 3 letters), the main domain will be something.ccc

Using bash

domain=...
md=
p2='^(.*\.)?([^.]+\.[a-z]+\.[a-z][a-z])$'
p3='^(.*\.)?([^.]+\.(com|org|net|int|edu|gov|mil))$'
px='^(.*\.)([a-z]+)$'

   # 2 letter country codes
if [[ "$domain" =~ $p2 ]] ; then
    md=${BASH_REMATCH[2]};
   # 3 letters legacy domain
elif [[ "$domain" =~ $p3 ]] ; then
    md=${BASH_REMATCH[2]};
   # All Other
elif [[ "$domain" =~ $px ]] ; then
    md=${BASH_REMATCH[2]};

fi ;
echo "$domain -> $md"

Could extend to handle few 4 letter domain

how to distinguish the domain from a subdomain

My solution to find the domain as registered with the registrar:

wget https://raw.githubusercontent.com/gavingmiller/second-level-domains/master/SLDs.csv

DOMAIN="www.e-learning.go4progress.co.uk";
KEEPPARTS=2;
TWOLEVELS=$( /bin/echo "${DOMAIN}" | /usr/bin/rev | /usr/bin/cut -d "." --output-delimiter=".\\" -f 1-2 | /usr/bin/rev );
if  /bin/grep -P ",\.${TWOLEVELS}" SLDs.csv >/dev/null;  then
    KEEPPARTS=3;
fi
DOMAIN=$( /bin/echo "${DOMAIN}" | /usr/bin/rev | /usr/bin/cut -d "." -f "1-${KEEPPARTS}" | /usr/bin/rev );
echo "${DOMAIN}"

Thanks to https://github.com/gavingmiller/second-level-domains and https://github.com/medialize/URI.js/issues/17#issuecomment-3976617

Foreach loop in bash

Using grep:

grep -F -f domains.csv url.csv

Test Results:

$ cat wordlist 
github.com
youtube.com
facebook.com

$ cat urllist 
| URL                           |
| ------------------------------|
| http://github.com/name        |
| http://stackoverflow.com/name2|
| http://stackoverflow.com/name3|
| http://www.linkedin.com/name3 |

$ grep -F -f wordlist urllist 
| http://github.com/name        |

Extract filename and extension in Bash

First, get file name without the path:

filename=$(basename -- "$fullfile")
extension="${filename##*.}"
filename="${filename%.*}"

Alternatively, you can focus on the last '/' of the path instead of the '.' which should work even if you have unpredictable file extensions:

filename="${fullfile##*/}"

You may want to check the documentation :

On the web at section "3.5.3 Shell Parameter Expansion"
In the bash manpage at section called "Parameter Expansion"

Extract Parent Domain Name from a List of Url Through Bash Shellscripting