Why do HTML entity names with dec 255 not require semicolon?
The reason is that historically the semicolon has been optional when an entity reference (or a character reference) is not immediately followed by a name character. So £?
is OK since ?
is not a name character (i.e., a character allowed in names), but £4
is not, since 4
is a name character, making pound4
the entity name (which is undefined in HTML, but might become defined some day). This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.
It has, however, always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it even formally mandatory.
This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters, i.e. characters with Unicode number less than 256 in decimal (FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage into using the recommendable notation but not invalidate a bulk of old pages, still less to make browsers fail to render them properly.
The HTML5 drafts have had various positions on this, but e.g. HTML5 CR from 6 August 2013 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error!
Are There Downsides to Writing `<` Instead of ``
This is entirely up to how forgiving the browser/rendering engine wants to be, and is not a property of HTML. All entities must end in a semi-colon, or you have invalid syntax. The WHATWG "HTML Living Standard" confusingly considers this semi-colon to be part of the name, making it seem optional in the Developer Edition. But the full Standard text/W3C HTML5 draft is clearer: "The name must be one that is terminated by a U+003B SEMICOLON character (;)."
Historically the semicolon has been optional when a character entity is not immediately followed by a name character. For example, £?
will work because ?
is not a name character (i.e., a character allowed in names), but £4
will not because 4 is a name character, making pound4
the entity name which is undefined. This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.
That being said, it has always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it mandatory.
This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters (characters with Unicode number less than 256 in decimal or FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage using the specified notation but not invalidate a bulk of old pages that don't conform and make browsers fail to render them properly.
The HTML5 drafts have had various positions on this, but HTML5 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error.
According to the W3C Recommendation
In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
While the W3C Working Draft states
The ampersand must be followed by one of the names given in §8.5 Named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).
Because the semicolon is required for W3C validation, and because it works in all browsers you should use it. The absolutely minuscule amount of page size you will save by not using them is not worth the risk of them not displaying right in all browsers.
Here are two answers to similar questions about this topic:
Answer 1
Answer 2
Using lxml.html with broken html entities?
The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional.
The html.unescape()
function explicitly supports those, use that function as a second pass to clear out this issue:
>>> from html import unescape
>>> unescape("Kristján Víctor")
'Kristján Víctor'
If you install html5lib
then you can have lxml behave the same, via their lxml.html.html5parser
module (which wraps html5lib
's own html5lib.treebuilders.etree_lxml
adapter):
>>> from lxml.html import html5parser as etree
>>> etree.fromstring("Kristján Víctor").text
'Kristján Víctor'
PHP ¤ string turns into weird symbol
That's the entity code for the currency symbol being interpreted. If you're building your GET url, you can solve it in various ways:
Use
urlencode()
on your query values:$s = 'page.com?' . urlencode("a=1¤tPage=2");
Use the entity for
&
itself;'page.com?a=1¤tPage=2'
Or use your variable at the beginning so no
&
is required:'page.com?currentPage=2&a=1'
what characters are allowed in HTTP header values?
RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.
The NUL octet is no longer allowed in comment and quoted-string text,
and handling of backslash-escaping in them has been clarified. The
quoted-pair rule no longer allows escaping control characters other
than HTAB. Non-US-ASCII content in header fields and the reason phrase
has been obsoleted and made opaque (the TEXT rule was removed).
(Section 3.2.6)
In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).
What characters are allowed in an email address?
See RFC 5322: Internet Message Format and, to a lesser extent, RFC 5321: Simple Mail Transfer Protocol.
RFC 822 also covers email addresses, but it deals mostly with its structure:
addr-spec = local-part "@" domain ; global address
local-part = word *("." word) ; uninterpreted
; case-preserved
domain = sub-domain *("." sub-domain)
sub-domain = domain-ref / domain-literal
domain-ref = atom ; symbolic reference
And as usual, Wikipedia has a decent article on email addresses:
The local-part of the email address may use any of these ASCII characters:
- uppercase and lowercase Latin letters
A
toZ
anda
toz
;- digits
0
to9
;- special characters
!#$%&'*+-/=?^_`{|}~
;- dot
.
, provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g.John..Doe@example.com
is not allowed but"John..Doe"@example.com
is allowed);- space and
"(),:;<>@[\]
characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash);- comments are allowed with parentheses at either end of the local-part; e.g.
john.smith(comment)@example.com
and(comment)john.smith@example.com
are both equivalent tojohn.smith@example.com
.
In addition to ASCII characters, as of 2012 you can use international characters above U+007F
, encoded as UTF-8 as described in the RFC 6532 spec and explained on Wikipedia. Note that as of 2019, these standards are still marked as Proposed, but are being rolled out slowly. The changes in this spec essentially added international characters as valid alphanumeric characters (atext) without affecting the rules on allowed & restricted special characters like !#
and @:
.
For validation, see Using a regular expression to validate an email address.
The domain
part is defined as follows:
The Internet standards (Request for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters
a
throughz
(in a case-insensitive manner), the digits0
through9
, and the hyphen (-
). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or blank spaces are permitted.
Javascript parse float is ignoring the decimals after my comma
This is "By Design". The parseFloat
function will only consider the parts of the string up until in reaches a non +, -, number, exponent or decimal point. Once it sees the comma it stops looking and only considers the "75" portion.
- https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/parseFloat
To fix this convert the commas to decimal points.
var fullcost = parseFloat($("#fullcost").text().replace(',', '.'));
Related Topics
Vertically Center Text in a 100% Height Div
Keep Padding from Making The Element Bigger
Why Is My Background Color Not Showing If I Have Display: Inline
Where Is The Visual Studio HTML Designer
Svg Line Markers Not Updating When Line Moves in Ie10
Make an HTML Element Non-Focusable
How to Make My Navi-Bar The Same Across My HTML
How to Convert Nunit Output into an HTML Report
Setting a Div's Height in HTML with CSS
Conditionally-Rendering CSS in HTML Head
How to Stop an Image Displaying Outside of The Div
How to Style The Browser's Autocomplete Dropdown Box
How to Style and Align Forms Without Tables