What Is a Good Regular Expression to Match a Url

What is a good regex to match for URLs in dataweave?

I updated the regex you have shared and replaced match with matches as you would like to validate the url against the regex.

%dw 2.0
var myString = "https://www.mycompany.com"
output application/json
---
{
"match" : myString matches (/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()\/@:%_\+.~#?&=]*)/)
}

Regex for website or url validation

Use the regex ^((https?|ftp|smtp):\/\/)?(www.)?[a-z0-9]+\.[a-z]+(\/[a-zA-Z0-9#]+\/?)*$

This is a basic one I build just now. A google search can give you more.

Here

  • ^ Should start with
  • ((https?|ftp|smtp)://)? may or maynot contain any of these protocols
  • (www.)? may or may not have www.
  • [a-z0-9]+(.[a-z]+) url and domain and also subdomain if any upto 2 levels
  • (/[a-zA-Z0-9#]+/?)*/? can contain path to files but not necessary. last may contain a /
  • $ should end there

var a=["http://www.sample.com","https://www.sample.com/","https://www.sample.com#","http://www.sample.com/xyz","http://www.sample.com/#xyz","www.sample.com","www.sample.com/xyz/#/xyz","sample.com","sample.com?name=foo","http://www.sample.com#xyz","http://www.sample.c"];

var re=/^((https?|ftp|smtp):\/\/)?(www.)?[a-z0-9]+(\.[a-z]{2,}){1,3}(#?\/?[a-zA-Z0-9#]+)*\/?(\?[a-zA-Z0-9-_]+=[a-zA-Z0-9-%]+&?)?$/;

a.map(x=>console.log(x+" => "+re.test(x)));

Regular expression to match URLs in Java

Try the following regex string instead. Your test was probably done in a case-sensitive manner. I have added the lowercase alphas as well as a proper string beginning placeholder.

String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

This works too:

String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

Note:

String regex = "<\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // matches <http://google.com>

String regex = "<^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // does not match <http://google.com>

Regex to find a valid URL in a email body regardless of newlines dividing it, and it need to contains '?' character

Components of a URI

foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/\__________/ \__/
| | | | |
scheme authority path query fragment

Scheme

The scheme of a URL is the first item, such as http, which indicates that this URI uses the hyper-text transport protocol. Examples of other schemes are:

Sample Image

Authority

In a URL the authority is also called the domain and may include a port number at the end separated by a colon.

In the following example, the authority is www.cambiaresearch.com
*

http://www.cambiaresearch.com

In the following example, the authority is www.cambiaresearch.com:81

https://www.cambiaresearch.com:81

In the following example, the authority is info@cambiaresearch.com

mailto:info@cambiaresearch.com

Path

The path component of the URL specifies the specific file (or page) at a particular domain. The path is terminated by the end of the URL, a question mark (?) which signifies the beginning of the query string or the number sign (#) which signifies the beginning of the fragment.

The path of the following URL is "/default.htm"

http://www.cambiaresearch.com/default.htm

The path of the following URL is "/snippets/csharp/regex/uri_regex.aspx"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx

Query

The query part of the URL is a way to send some information to the path or webpage that will handle the web request. The query begins with a question mark (?) and is terminated by the end of the URL or a number sign (#) which signifies the beginning of the fragment.

The query of the following URL is "?id=241"

http://www.cambiaresearch.com/default.htm?id=241

The query of the following URL is "?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC: 1969-53,GGLC:en&q=uri+query"

http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:1969-53,GGLC:en&q=uri+query

Fragment

In a URL the fragment is used to specify a location within the current page. This is often used in a FAQ with a list of links at the top of the page linking to longer descriptions farther down in the page.

The fragment of the following URL is "contact"

http://www.cambiaresearch.com/default.htm#contact

The fragment of the following URL is "scheme"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx#scheme


Example: Regular Expressions for Parsing URIs and URLs

Simple way using [?] regex pattern:

public bool RegexUrlWithQuestionChar(string url)
{
string pattern = @"(http(s)?://)?([\w-]+\.)+[\w-]+(/[\w- ;,./?%&=]*)?"; //Url pattern

var regex = new Regex(pattern);
var math = regex.Match(url);

return new Regex("[?]").IsMatch(math.Value); //Find ?
}

if(RegexUrlWithQuestionChar("www.example.com.br/area?key=235fksf&rec=fsjgsg"))
{
MessageBox.Show("Found"); // This show
}
else
{
MessageBox.Show("Not found");
}

if(RegexUrlWithQuestionChar("www.example.com.br/area"))
{
MessageBox.Show("Found");
}
else
{
MessageBox.Show("Not found"); // This show
}

Credits:

urlregex.com

parsing-urls-with-regular-expressions-and-the-regex-object

www.dotnetperls.com/regex

How to write custom regular expression for url(custom url)

Your pattern has no anchors, and the initial subpattern is a character class [http://a-zA-Z0-9]{1,20} that matches 1 to 20 chars from the class, either h or t, p, :, /, a-z, A-Z, 0-9 while you need to match http:// as a sequence.

I suggest

^(https?:\/\/)?[a-zA-Z][a-zA-Z0-9]{0,19}\.constant\.[a-zA-Z]{1,5}$

See the regex demo

Explanation:

  • ^ - start of string
  • (https?:\/\/)? - an optional sequence of http:// or https://
  • [a-zA-Z] - an ASCII letter
  • [a-zA-Z0-9]{0,19} - 0 to 19 alphanumeric characters (the length restriction can be adjusted by you)
  • \.constant\. - a constant substring .constant.
  • [a-zA-Z]{1,5} - 1 to 5 ASCII letters
  • $ - end of string.

Regex to match URL

$search  = "#^((?#
the scheme:
)(?:https?://)(?#
second level domains and beyond:
)(?:[\S]+\.)+((?#
top level domains:
)MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
)COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
)A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
)C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
)E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
)H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
)K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
)N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
)S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
)U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
the path, can be there or not:
)(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?#

to

  )(?:https?://)?(?#

(there is a ? extra) to match 'URLs' without the scheme.



Related Topics



Leave a reply



Submit