What Characters Are Valid in a Url

Characters allowed in a URL

EDIT: As @Jukka K. Korpela correctly points out, RFC 1738 was updated by RFC 3986.
This has expanded and clarified the characters valid for host, unfortunately it's not easily copied and pasted, but I'll do my best.

In first matched order:

host        = IP-literal / IPv4address / reg-name

IP-literal = "[" ( IPv6address / IPvFuture ) "]"

IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"

ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address

h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255

reg-name = *( unreserved / pct-encoded / sub-delims )

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" <---This seems like a practical shortcut, most closely resembling original answer

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

pct-encoded = "%" HEXDIG HEXDIG

Original answer from RFC 1738 specification:

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.

^ obsolete since 1998.

Valid characters for directory part of a URL (for short links)

A path segment (the parts in a path separated by /) in an absolute URI path can contain zero or more of pchar that is defined as follows:

  pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

So it’s basically AZ, az, 09, -, ., _, ~, !, $, &, ', (, ), *, +, ,, ;, =, :, @, as well as % that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.

Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E instead of ~). That’s why many use just the 62 alphanumeric characters (i.e. AZ, az, 09) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. AZ, az, 09, -, _).

What are the safe characters for making URLs?

To quote section 2.3 of RFC 3986:

Characters that are allowed in a URI, but do not have a reserved
purpose, are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.

  ALPHA  DIGIT  "-" / "." / "_" / "~"

Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.

What's valid and what's not in a URI query?

That a character is reserved within a generic URL component doesn't mean it must be escaped when it appears within the component or within data in the component. The character must also be defined as a delimiter within the generic or scheme-specific syntax and the appearance of the character must be within data.

The current standard for generic URIs is RFC 3986, which has this to say:

2.2. Reserved Characters

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter [emphasis added], then the conflicting data must be percent-encoded before the URI is formed.

   reserved    = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

3.3. Path Component

[...]
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
[...]

3.4 Query Component

[...]
      query       = *( pchar / "/" / "?" )

Thus commas are explicitly allowed within query strings and only need to be escaped in data if specific schemes define it as a delimiter. The HTTP scheme doesn't use the comma or semi-colon as a delimiter in query strings, so they don't need to be escaped. Whether browsers follow this standard is another matter.

Using CSV should work fine for string data, you just have to follow standard CSV conventions and either quote data or escape the commas with backslashes.

As for RFC 2396, it also allows for unescaped commas in HTTP query strings:

2.2. Reserved Characters

Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.

Since commas don't have a reserved purpose under the HTTP scheme, they don't have to be escaped in data. The note from § 2.3 about reserved characters being those that change semantics when percent-encoded applies only generally; characters may be percent-encoded without changing semantics for specific schemes and yet still be reserved.

What are the valid characters that can show up in a URL host?

Please see Restrictions on valid host names:

Hostnames are composed of series of
labels concatenated with dots, as are
all domain names1. For example,
"en.wikipedia.org" is a hostname. Each
label must be between 1 and 63
characters long, and the entire
hostname has a maximum of 255
characters.

RFCs mandate that a hostname's labels
may contain only the ASCII letters 'a'
through 'z' (case-insensitive), the
digits '0' through '9', and the
hyphen. Hostname labels cannot begin
or end with a hyphen. No other
symbols, punctuation characters, or
blank spaces are permitted.

Characters allowed in GET parameter

There are reserved characters, that have a reserved meanings, those are delimiters — :/?#[]@ — and subdelimiters — !$&'()*+,;=

There is also a set of characters called unreserved characters — alphanumerics and -._~ — which are not to be encoded.

That means, that anything that doesn't belong to unreserved characters set is supposed to be %-encoded, when they do not have special meaning (e.g. when passed as a part of GET parameter).

See also RFC3986: Uniform Resource Identifier (URI): Generic Syntax



Related Topics



Leave a reply



Submit