Do I Encode Ampersands in ≪A Href...≫

Do I encode ampersands in a href...?

Yes, it is. HTML entities are parsed inside HTML attributes, and a stray & would create an ambiguity. That's why you should always write & instead of just & inside all HTML attributes.

That said, only & and quotes need to be encoded. If you have special characters like é in your attribute, you don't need to encode those to satisfy the HTML parser.

It used to be the case that URLs needed special treatment with non-ASCII characters, like é. You had to encode those using percent-escapes, and in this case it would give %C3%A9, because they were defined by RFC 1738. However, RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It's therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.

Using ampersand character in href attribute

The URL is a value in an HTML attribute, so the & character should be HTML encoded, most commonly using the HTML entity &:

<a href="http://www.example.com/home.php?a=2&b=5">example</a>

You can also use the HTML entity & instead of &.

Should an ampersand be URL encoded in a query string?

From rfc3986:

Reserved Characters

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.

...

  reserved    = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI. URIs
that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
Percent-encoding a reserved character, or decoding a percent-encoded
octet that corresponds to a reserved character, will change how the
URI is interpreted by most applications.

...

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

So & within a URL should be encoded if it's part of the value and has no delimiting role.
Here's simple PHP code fragment using urlencode() function:

<?php
$query_string = 'foo=' . urlencode($foo) . '&bar=' . urlencode($bar);
echo '<a href="mycgi?' . htmlentities($query_string) . '">';
?>

Do ampersands still need to be encoded in URLs in HTML5?

It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:

The ampersand (&) may be left unescaped in more cases compared to HTML4.

In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.

In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:

  • a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
  • a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
  • any other character ===> the parser will try to find a named character reference, e.g., something like .

The last case is the one of interest to you since your example has:

<a href="somepage.html?x=1&y=2">...</a>

You have the character sequence

  • AMPERSAND
  • LATIN SMALL LETTER Y
  • EQUAL SIGN

Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:

If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.

You don't have a semicolon there, so you don't have a parse error.

Now suppose you had, instead,

<a href="somepage.html?x=1é=2">...</a>

which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:

If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

So there the = makes it an error, because legacy browsers might get confused.

Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.

It would be interesting to see what validators can do.

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:

http://query.com/?q=foo<=bar>=baz

Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:

http://query.com/?q=foo<=bar>=baz

So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.

Escaping ampersand in URL

They need to be percent-encoded:

> encodeURIComponent('&')
"%26"

So in your case, the URL would look like:

http://www.mysite.com?candy_name=M%26M


Related Topics



Leave a reply



Submit