Characters allowed in a URL
EDIT: As @Jukka K. Korpela correctly points out, RFC 1738 was updated by RFC 3986.
This has expanded and clarified the characters valid for host, unfortunately it's not easily copied and pasted, but I'll do my best.
In first matched order:
host = IP-literal / IPv4address / reg-name
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address
h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
reg-name = *( unreserved / pct-encoded / sub-delims )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" <---This seems like a practical shortcut, most closely resembling original answer
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pct-encoded = "%" HEXDIG HEXDIG
Original answer from RFC 1738 specification:
Thus, only alphanumerics, the special characters "
$-_.+!*'(),
", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
^ obsolete since 1998.
Valid characters for directory part of a URL (for short links)
A path segment (the parts in a path separated by /
) in an absolute URI path can contain zero or more of pchar that is defined as follows:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So it’s basically A
–Z
, a
–z
, 0
–9
, -
, .
, _
, ~
, !
, $
, &
, '
, (
, )
, *
, +
, ,
, ;
, =
, :
, @
, as well as %
that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.
Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E
instead of ~
). That’s why many use just the 62 alphanumeric characters (i.e. A
–Z
, a
–z
, 0
–9
) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. A
–Z
, a
–z
, 0
–9
, -
, _
).
What are the safe characters for making URLs?
To quote section 2.3 of RFC 3986:
Characters that are allowed in a URI, but do not have a reserved
purpose, are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.ALPHA DIGIT "-" / "." / "_" / "~"
Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.
What's valid and what's not in a URI query?
That a character is reserved within a generic URL component doesn't mean it must be escaped when it appears within the component or within data in the component. The character must also be defined as a delimiter within the generic or scheme-specific syntax and the appearance of the character must be within data.
The current standard for generic URIs is RFC 3986, which has this to say:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter [emphasis added], then the conflicting data must be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delimsgen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="3.3. Path Component
[...]pchar = unreserved / pct-encoded / sub-delims / ":" / "@"[...]3.4 Query Component
[...]
query = *( pchar / "/" / "?" )
Thus commas are explicitly allowed within query strings and only need to be escaped in data if specific schemes define it as a delimiter. The HTTP scheme doesn't use the comma or semi-colon as a delimiter in query strings, so they don't need to be escaped. Whether browsers follow this standard is another matter.
Using CSV should work fine for string data, you just have to follow standard CSV conventions and either quote data or escape the commas with backslashes.
As for RFC 2396, it also allows for unescaped commas in HTTP query strings:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
Since commas don't have a reserved purpose under the HTTP scheme, they don't have to be escaped in data. The note from § 2.3 about reserved characters being those that change semantics when percent-encoded applies only generally; characters may be percent-encoded without changing semantics for specific schemes and yet still be reserved.
What are the valid characters that can show up in a URL host?
Please see Restrictions on valid host names:
Hostnames are composed of series of
labels concatenated with dots, as are
all domain names1. For example,
"en.wikipedia.org" is a hostname. Each
label must be between 1 and 63
characters long, and the entire
hostname has a maximum of 255
characters.RFCs mandate that a hostname's labels
may contain only the ASCII letters 'a'
through 'z' (case-insensitive), the
digits '0' through '9', and the
hyphen. Hostname labels cannot begin
or end with a hyphen. No other
symbols, punctuation characters, or
blank spaces are permitted.
Characters allowed in GET parameter
There are reserved characters, that have a reserved meanings, those are delimiters — :/?#[]@
— and subdelimiters — !$&'()*+,;=
There is also a set of characters called unreserved characters — alphanumerics and -._~
— which are not to be encoded.
That means, that anything that doesn't belong to unreserved characters set is supposed to be %-encoded, when they do not have special meaning (e.g. when passed as a part of GET
parameter).
See also RFC3986: Uniform Resource Identifier (URI): Generic Syntax
Related Topics
How to Set the Margin or Padding as Percentage of Height of Parent Container
How to Ignore HTML Element from Tabindex
Html: Include, or Exclude, Optional Closing Tags
Regular Expression to Remove HTML Tags from a String
How to Include an HTML Page into Another HTML Page Without Frame/Iframe
How to Add Default Value For HTML ≪Textarea≫
Using Position Relative/Absolute Within a Td
Why Is 'Position: Sticky' Not Working With Core Ui'S Bootstrap Css
Html5 - Canvas Element - Multiple Layers
Why Can't ≪Fieldset≫ Be Flex Containers
Overlay Opaque Div Over Youtube Iframe
:After and :Before CSS Pseudo Elements Hack For Internet Explorer 7
What Characters Are Allowed in Dom Ids
Is Autocomplete="Off" Compatible With All Modern Browsers
How to Create a Div With a Curved Bottom
Default Select Option as Blank
Float:Left; VS Display:Inline; VS Display:Inline-Block; VS Display:Table-Cell;