Properly Matching a IDN URL
John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).
How do I match japanese characters using IDN regex?
Very slight modification made here has it working for me:
/(([\w-]+:\/\/?|[\w\d]+[.])?[^\s()<>]+[.](?:\([\w\d]+\)|([^`!()\[\]{};:'\".,<>?«»“”‘’\s]|\/)+))/
Splitting up an IDN URL in PHP
parse_url
should work fine. Using PHP 5.3.4 I've been able to extract just the domain part:
print parse_url('http://äxämple.se/foobar', PHP_URL_HOST);
Maybe you'll need to tweak encodings:
print utf8_decode(parse_url('http://äxämple.se/foobar', PHP_URL_HOST));
Output I've got is:
äxämple.se
Hope that helps!
Domain Name Regex Including IDN Characters c#
Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}
. The MSDN regex documentation contains the following:
\p{ name }
Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .
. In regex the dot character .
has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \.
to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L}
Represents the Unicode character class for any letter in any language/script\p{N}
Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep0-9
, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L}
or\p{Letter}
: any kind of letter from any language.
\p{Ll}
or\p{Lowercase_Letter}
: a lowercase letter that has an uppercase variant.\p{Lu}
or\p{Uppercase_Letter}
: an uppercase letter that has a lowercase variant.\p{Lt}
or\p{Titlecase_Letter}
: a letter that appears at the start of a word when only the first letter of the word is
capitalized.\p{L&}
or\p{Cased_Letter}
: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).\p{Lm}
or\p{Modifier_Letter}
: a special character that is used like a letter.\p{Lo}
or\p{Other_Letter}
: a letter or ideograph that does not have lowercase and uppercase variants.\p{M}
or\p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn}
or\p{Non_Spacing_Mark}
: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).\p{Mc}
or\p{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).\p{Me}
or\p{Enclosing_Mark}
: a character that encloses the character is is combined with (circle, square, keycap, etc.).\p{Z}
or\p{Separator}
: any kind of whitespace or invisible separator.
\p{Zs}
or\p{Space_Separator}
: a whitespace character that is invisible, but does take up space.\p{Zl}
or\p{Line_Separator}
: line separator character U+2028.\p{Zp}
or\p{Paragraph_Separator}
: paragraph separator character U+2029.\p{S}
or\p{Symbol}
: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm}
or\p{Math_Symbol}
: any mathematical symbol.\p{Sc}
or\p{Currency_Symbol}
: any currency sign.\p{Sk}
or\p{Modifier_Symbol}
: a combining character (mark) as a full character on its own.\p{So}
or\p{Other_Symbol}
: various symbols that are not math symbols, currency signs, or combining characters.\p{N}
or\p{Number}
: any kind of numeric character in any script.
\p{Nd}
or\p{Decimal_Digit_Number}
: a digit zero through nine in any script except ideographic scripts.\p{Nl}
or\p{Letter_Number}
: a number that looks like a letter, such as a Roman numeral.\p{No}
or\p{Other_Number}
: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).\p{P}
or\p{Punctuation}
: any kind of punctuation character.
\p{Pd}
or\p{Dash_Punctuation}
: any kind of hyphen or dash.\p{Ps}
or\p{Open_Punctuation}
: any kind of opening bracket.\p{Pe}
or\p{Close_Punctuation}
: any kind of closing bracket.\p{Pi}
or\p{Initial_Punctuation}
: any kind of opening quote.\p{Pf}
or\p{Final_Punctuation}
: any kind of closing quote.\p{Pc}
or\p{Connector_Punctuation}
: a punctuation character such as an underscore that connects words.\p{Po}
or\p{Other_Punctuation}
: any kind of punctuation character that is not a dash, bracket, quote or connector.\p{C}
or\p{Other}
: invisible control characters and unused code points.
\p{Cc}
or\p{Control}
: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.\p{Cf}
or\p{Format}
: invisible formatting indicator.\p{Co}
or\p{Private_Use}
: any code point reserved for private use.\p{Cs}
or\p{Surrogate}
: one half of a surrogate pair in UTF-16 encoding.\p{Cn}
or\p{Unassigned}
: any code point to which no character has been assigned.
What is a regular expression which will match a valid domain name without a subdomain?
Well, it's pretty straightforward a little sneakier than it looks (see comments), given your specific requirements:
/^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}$/
But note this will reject a lot of valid domains.
By what application is the domain name's length restricted by, and can this be changed if it is a different DNS implementation?
Thi is an RFC document you need:
http://www.faqs.org/rfcs/rfc1035.html
According to this different DNS records has it's own limit.
Probably you shouldn't think about a particular software or different practices, you just should look to an appropriate document (rfc specification).
WHMCS IDN and punncycode
On the end there was no asnwer from denic.de or WHMCS or any other prvoder so i made my own filter, connected on my means to my own services from where i query for domains. So, there is a workaround, you need to check how WHMCS is doing queries for domains, relink the whois to your own server with your own script, query the original whois and give the answers to WHMCS from your services.
Related Topics
How to Pass Parameters from Bash to PHP Script
PHP Eval That Evaluates HTML & PHP
Prevent Browser Back Button Cache
Swap Two Words in a String PHP
Can You Pass by Reference While Using the Ternary Operator
Difference Between Pdo->Query() and Pdo->Exec()
PHP Xml Inserting Element After (Or Before) Another Element
PHP Curl: How to Set Body to Binary Data
Check If Variable Starts with 'Http'
Foreach with Three Variables Add
PHP - Fastest Way to Check Presence of Text in Many Domains (Above 1000)
Use Variable as Function Name in PHP
Number_Format() Causes Error "A Non Well Formed Numeric Value Encountered"
How to Ensure I Caught All Errors from MySQLi::Multi_Query
Download Multiple Files in One Http Request
Cookies Not Working on Different Pages