HTML: Should I Encode Greater Than or Not? ( > > )

HTML: Should I encode greater than or not? ( )

Strictly speaking, to prevent HTML injection, you need only encode < as <.

If user input is going to be put in an attribute, also encode " as ".

If you're doing things right and using properly quoted attributes, you don't need to worry about >. However, if you're not certain of this you should encode it just for peace of mind - it won't do any harm.

What character encoding is ?

This might answer your question. Basically it is HTML encoding for a few predefined characters.

Characters like > and & are HTML Entities specifically, they are Named HTML Entities

What do and stand for?

< stands for the less-than sign: <
> stands for the greater-than sign: >
≤ stands for the less-than or equals sign: ≤
≥ stands for the greater-than or equals sign: ≥

When should one use HTML entities?

You don't generally need to use HTML character entities if your editor supports Unicode. Entities can be useful when:

Your keyboard does not support the character you need to type. For example, many keyboards do not have em-dash or the copyright symbol.
Your editor does not support Unicode (very common some years ago, but probably not today).
You want to make it explicit in the source what is happening. For example, the code is clearer than the corresponding white space character.
You need to escape HTML special characters like <, &, or ".

Content type vs HTML encoding

The article is in fact correct. If you have proper UTF-8 encoded data, there is no reason to use HTML entities for special characters on normal web pages any more.

I say "on normal web pages", because there are highly exotic borderline scenarios where using entities is still the safest bet (e.g. when serving JavaScript code to an external page with unknown encoding). But for serving pages to a browser, this doesn't apply.

HTML and character encoding vs HTML Entity

It all depends on the character encoding of the document. If you're unsure of whether or not you should use the the regular text or the encoding version, you could run your page through the W3C Validator.

Consider this code:

<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <title>Stuff</title>
</head>
<body>
 <p>©</p>
 <p>©</p>
</body>
</html>

The document encoding is set to UTF-8 and when it's validated, it returns an error:

Sorry, I am unable to validate this document because on line 7 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

Should we HTML-encode special characters before storing them in the database?

Don't HTML-encode your characters before storage. You should store as pure a form of your data as possible. HTML encoding is needed because you are going to display the data on an HTML page, so do the encoding during the processing of the data to create the page. For example, suppose you decide you're also going to send the data in plain text emails. If you've HTML-encoded the data, now the HTML encoding is a barrier that you have to undo.

Choose a canonical form for your data, and store that. UTF-8 is wonderful, and your database supports it (assuming you've created all your tables properly). Just store UTF-8.

Confused with html encoding

You have this confused. Character encoding is an attribute of YOUR systems. Your websites and your database are responsible for character encoding.

You have to decide what you will accept. I would say in general, the web has moved towards standardization on UTF-8. So if your websites that accept user input AND your database, and all connections involved are UTF-8, then you are in a position to accept input as UTF-8, and your character set and collation in the database should be configured appropriately.

At this point all your web pages should be HTML5, so the recommended HEAD section of your pages should at a minimum be this:

<!DOCTYPE html>
<html lang="en"> 
<head>
<meta charset="utf-8"/>

Next you have SQL injection. You specified PHP. If you are using mysqli or PDO (which is in my experience the better choice) AND you are using bindParameter for all your variables, there is NO ISSUE with SQL injection. That issue goes away, and the need for escaping input goes away, because you no longer have to be concerned that a SQL statement could get confused. It's not possible anymore.

Finally, you mentioned htmlpurifier. That exists so that people can try and avoid XSS and other exploits of that nature, that occur when you accept user input, and those people inject html & js.

That is always going to be a concern, depending on the nature of the system and what you do with that output, but as others suggested in comments, you can run sanitizers and filters on the output after you've retrieved it from the database. Sitting inside a php string variable there is no intrinsic danger, until you weaponize it by injecting it into a live html page you are serving.

In terms of finding bad actors and people trying to mess with your system, you are obviously much better off having stored the original input as submitted. Then as you come to understand the nature of these exploits, you can search through your database looking for specific things, which you won't be able to do if you sanitize first and store the result.

Should I still use html entities? Why?

If the encoding is set correctly (and the document is saved as UTF-8) you should be able to work with just the characters. From the W3C:

Using an encoding such as UTF-8 means that you can avoid the need for most escapes and just work with characters.

http://www.w3.org/International/questions/qa-escapes

However, you still need to use entities for special characters such at greater/less than.

HTML: Should I Encode Greater Than or Not? ( > > )