What's HTML Character Code 8203

What's HTML character code 8203?

It's the Unicode Character 'ZERO WIDTH SPACE' (U+200B).

this character is intended for line break control; it has no width, but its presence between two characters does not prevent increased letter spacing in justification

As per the given code sample, the entity is entirely superfluous in this context. It must be inserted by some accident, most likely by a buggy editor trying to do smart things with whitespace or highlighting, or an enduser using a keyboard language wherein this character is natively been used, such as Arabic.

Why is ​ being injected into my HTML?

It is an issue in the source. The live example that you provided starts with the following bytes (i.e., they appear before <!DOCTYPE html>): 0xE2 0x80 0x8B. This can be seen e.g. using Rex Swain’s HTTP Viewer by selecting “Hex” under “Display Format”. Also note that validating the page with the W3C Markup Validator gives information that suggests that there is something very wrong at the start of the document, especially the message “Line 1, Column 1: Non-space characters found without seeing a doctype first.”

What happens in the validator and in the Chrome tools – as well as e.g. in Firebug – is that the bytes 0xE2 0x80 0x8B are taken as character data, which implicitly starts the body element (since character data cannot validly appear in the head element or before it), implying an empty head element before it.

The solution, of course, is to remove those bytes. Browsers usually ignore them, but you should not rely on such error handling, and the bytes prevent useful HTML validation. How you remove them, and how they got there in the first place, depends on your authoring environment.

Since the page is declared (in HTTP headers) as being UTF-8 encoded, those bytes represent the ZERO WIDTH SPACE (U+200B) character. It has no visible glyph and no width, so you won’t notice anything in the visual presentation even though browsers treat it as being data at the start of the body element. The notation is a character reference for it, presumably used by browser tools to indicate the presence of a normally invisible character.

It is possible that the software that produced the HTML document was meant to insert ZERO WIDTH NO-BREAK SPACE (U+FEFF) instead. That would have been valid, since by a special convention, UTF-8 encoded data may start with this character, also known as byte order mark (BOM) when appearing at the start of data. Using U+200B instead of U+FEFF sounds like an error that software is unlikely to make, but human beings may be mistaken that way if they think of the Unicode names of the characters.

Why am I getting this weird “​” line break?

Ok, so I found a solution from this thread.

The problem is that when you try to fetch the html of a javascript string, you might get zero width space.

Unicode has the following zero-width characters:

  • U+200B zero width space
  • U+200C zero width non-joiner Unicode code point
  • U+200D zero width joiner Unicode code point
  • U+FEFF zero width no-break space Unicode code point

So to fix my problem by I use regular expression to remove the unicode charecter:

var source = $("#guideListTemplate").html().replace(/[\u200B]/g, '');

JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string

The number in a unicode escape should be in hex, and the hex for 8203 is 200B (which is indeed a Unicode zero-width space), so:

var b = a.replace(/\u200B/g,'');

Live Example:

var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u200B/g,'');
console.log("a.length = " + a.length); // 3
console.log("a === 'om'? " + (a === 'om')); // false
console.log("b.length = " + b.length); // 2
console.log("b === 'om'? " + (b === 'om')); // true

Breaking space (opposite of non-breaking space)

is a regular space (by its numeric ASCII value).

Remove zero-width space characters from a JavaScript string

Unicode has the following zero-width characters:

  • U+200B zero width space
  • U+200C zero width non-joiner Unicode code point
  • U+200D zero width joiner Unicode code point
  • U+FEFF zero width no-break space Unicode code point

To remove them from a string in JavaScript, you can use a simple regular expression:

var userInput = 'a\u200Bb\u200Cc\u200Dd\uFEFFe';
console.log(userInput.length); // 9
var result = userInput.replace(/[\u200B-\u200D\uFEFF]/g, '');
console.log(result.length); // 5

Note that there are many more symbols that may not be visible. Some of ASCII’s control characters, for example.



Related Topics



Leave a reply



Submit