Why Do HTML Entity Names with Dec < 255 Not Require Semicolon

Why do HTML entity names with dec 255 not require semicolon?

The reason is that historically the semicolon has been optional when an entity reference (or a character reference) is not immediately followed by a name character. So £? is OK since ? is not a name character (i.e., a character allowed in names), but £4 is not, since 4 is a name character, making pound4 the entity name (which is undefined in HTML, but might become defined some day). This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.

It has, however, always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it even formally mandatory.

This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters, i.e. characters with Unicode number less than 256 in decimal (FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage into using the recommendable notation but not invalidate a bulk of old pages, still less to make browsers fail to render them properly.

The HTML5 drafts have had various positions on this, but e.g. HTML5 CR from 6 August 2013 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error!

Are There Downsides to Writing `<` Instead of ``

This is entirely up to how forgiving the browser/rendering engine wants to be, and is not a property of HTML. All entities must end in a semi-colon, or you have invalid syntax. The WHATWG "HTML Living Standard" confusingly considers this semi-colon to be part of the name, making it seem optional in the Developer Edition. But the full Standard text/W3C HTML5 draft is clearer: "The name must be one that is terminated by a U+003B SEMICOLON character (;)."

Historically the semicolon has been optional when a character entity is not immediately followed by a name character. For example, £? will work because ? is not a name character (i.e., a character allowed in names), but £4 will not because 4 is a name character, making pound4 the entity name which is undefined. This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.

That being said, it has always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it mandatory.

This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters (characters with Unicode number less than 256 in decimal or FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage using the specified notation but not invalidate a bulk of old pages that don't conform and make browsers fail to render them properly.

The HTML5 drafts have had various positions on this, but HTML5 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error.

According to the W3C Recommendation

In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

While the W3C Working Draft states

The ampersand must be followed by one of the names given in §8.5 Named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Because the semicolon is required for W3C validation, and because it works in all browsers you should use it. The absolutely minuscule amount of page size you will save by not using them is not worth the risk of them not displaying right in all browsers.

Here are two answers to similar questions about this topic:
Answer 1
Answer 2

Using lxml.html with broken html entities?

The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional.

The html.unescape() function explicitly supports those, use that function as a second pass to clear out this issue:

>>> from html import unescape
>>> unescape("Kristján Víctor")
'Kristján Víctor'

If you install html5lib then you can have lxml behave the same, via their lxml.html.html5parser module (which wraps html5lib's own html5lib.treebuilders.etree_lxml adapter):

>>> from lxml.html import html5parser as etree
>>> etree.fromstring("Kristján Víctor").text
'Kristján Víctor'

PHP ¤ string turns into weird symbol

That's the entity code for the currency symbol being interpreted. If you're building your GET url, you can solve it in various ways:

  • Use urlencode() on your query values:

    $s = 'page.com?' . urlencode("a=1¤tPage=2");

  • Use the entity for & itself;

    'page.com?a=1&currentPage=2'

  • Or use your variable at the beginning so no & is required:

    'page.com?currentPage=2&a=1'

what characters are allowed in HTTP header values?

RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.

The NUL octet is no longer allowed in comment and quoted-string text,
and handling of backslash-escaping in them has been clarified. The
quoted-pair rule no longer allows escaping control characters other
than HTAB. Non-US-ASCII content in header fields and the reason phrase
has been obsoleted and made opaque (the TEXT rule was removed).

(Section 3.2.6)

In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).

What characters are allowed in an email address?

See RFC 5322: Internet Message Format and, to a lesser extent, RFC 5321: Simple Mail Transfer Protocol.

RFC 822 also covers email addresses, but it deals mostly with its structure:

 addr-spec   =  local-part "@" domain        ; global address     
local-part = word *("." word) ; uninterpreted
; case-preserved

domain = sub-domain *("." sub-domain)
sub-domain = domain-ref / domain-literal
domain-ref = atom ; symbolic reference

And as usual, Wikipedia has a decent article on email addresses:

The local-part of the email address may use any of these ASCII characters:

  • uppercase and lowercase Latin letters A to Z and a to z;
  • digits 0 to 9;
  • special characters !#$%&'*+-/=?^_`{|}~;
  • dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. John..Doe@example.com is not allowed but "John..Doe"@example.com is allowed);
  • space and "(),:;<>@[\] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash);
  • comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com.

In addition to ASCII characters, as of 2012 you can use international characters above U+007F, encoded as UTF-8 as described in the RFC 6532 spec and explained on Wikipedia. Note that as of 2019, these standards are still marked as Proposed, but are being rolled out slowly. The changes in this spec essentially added international characters as valid alphanumeric characters (atext) without affecting the rules on allowed & restricted special characters like !# and @:.

For validation, see Using a regular expression to validate an email address.

The domain part is defined as follows:

The Internet standards (Request for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters a through z (in a case-insensitive manner), the digits 0 through 9, and the hyphen (-). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or blank spaces are permitted.

Javascript parse float is ignoring the decimals after my comma

This is "By Design". The parseFloat function will only consider the parts of the string up until in reaches a non +, -, number, exponent or decimal point. Once it sees the comma it stops looking and only considers the "75" portion.

  • https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/parseFloat

To fix this convert the commas to decimal points.

var fullcost = parseFloat($("#fullcost").text().replace(',', '.'));


Related Topics



Leave a reply



Submit