Remove Zero-Width Space Characters from a JavaScript String

Remove zero-width space characters from a JavaScript string

Unicode has the following zero-width characters:

  • U+200B zero width space
  • U+200C zero width non-joiner Unicode code point
  • U+200D zero width joiner Unicode code point
  • U+FEFF zero width no-break space Unicode code point

To remove them from a string in JavaScript, you can use a simple regular expression:

var userInput = 'a\u200Bb\u200Cc\u200Dd\uFEFFe';
console.log(userInput.length); // 9
var result = userInput.replace(/[\u200B-\u200D\uFEFF]/g, '');
console.log(result.length); // 5

Note that there are many more symbols that may not be visible. Some of ASCII’s control characters, for example.

JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string

The number in a unicode escape should be in hex, and the hex for 8203 is 200B (which is indeed a Unicode zero-width space), so:

var b = a.replace(/\u200B/g,'');

Live Example:

var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u200B/g,'');
console.log("a.length = " + a.length); // 3
console.log("a === 'om'? " + (a === 'om')); // false
console.log("b.length = " + b.length); // 2
console.log("b === 'om'? " + (b === 'om')); // true

How to remove zero-width space characters ‍ from the text

The string is the HTML character entity for the zero-width joiner. When a web browser sees it it will replace it with an actual zero-width joiner, but as far as Ruby is concerned it is just a 5 character string.

What you want to do is to specify the actual zero-width joiner character. It has the codepoint U+200D, so you can use it like this, using Ruby’s Unicode escape:

text.gsub("\u200D", "")

This should remove the zero-width joiner characters, rather than looking for the string which your original code was doing.

\u200b (Zero width space) characters in my JS code. Where did they come from?

Here's a stab in the dark.

My bet would be on Google Chrome Inspector. Searching through the Chromium source, I spotted the following block of code

    if (hasText)
attrSpanElement.appendChild(document.createTextNode("=\u200B\""));

if (linkify && (name === "src" || name === "href")) {
var rewrittenHref = WebInspector.resourceURLForRelatedNode(node, value);
value = value.replace(/([\/;:\)\]\}])/g, "$1\u200B");
attrSpanElement.appendChild(linkify(rewrittenHref, value, "webkit-html-attribute-value", node.nodeName().toLowerCase() === "a"));
} else {
value = value.replace(/([\/;:\)\]\}])/g, "$1\u200B");
var attrValueElement = attrSpanElement.createChild("span", "webkit-html-attribute-value");
attrValueElement.textContent = value;
}

It's quite possible that I'm simply barking up the wrong tree here, but it looks like zero-width spaces were being inserted (to handle soft text wrapping?) during the display of attributes. Perhaps the "Copy as HTML" function had not properly removed them?


Update

After fiddling with the Chrome element inspector, I'm almost convinced that's where your stray \u200b came from. Notice how the line can wrap not only at visible space but also after = or chars matched by /([\/;:\)\]\}])/ thanks to the inserted zero-width space.

chrome inspector screenshot

Unfortunately, I am unable to replicate your problem where they inadvertently get included into your clipboard (I used Chrome 13.0.782.112 on Win XP).

It would certainly be worth submitting a bug report should your be able to reproduce the behaviour.

How to remove \u200B (Zero Length Whitespace Unicode Character) from String in Java?

Finally, I am able to remove 'Zero Width Space' character by using 'Unicode Regex'.

String plainEmailBody = new String();
plainEmailBody = emailBodyStr.replaceAll("[\\p{Cf}]", "");

Reference to find the category of Unicode characters.

  1. Character class from Java.

Character class from Java lists all of these unicode categories.

Sample Image


  1. Website: http://www.fileformat.info/

Character category


  1. Website: http://www.regular-expressions.info/ => Unicode Regular Expressions

Unicode Regex for \u200B character

Note 1: As I received this string from Outlook Email Body - none of the approaches listed in my question was working.

My application is receiving a String from an external system
(Outlook), So I have no control over it.

Note 2: This SO answer helped me to know about Unicode Regular Expressions .

HTML - How to remove Zero-Width No Break Space

In my case, I opened up all files that I thought were be an impact into Notepad++, then on each file (in Notepad++), I changed the encoding to Encode in ANSI. One of the files had some hieroglyphics on top (line 1). Removed that and saved back to normal encoded state.

Zero-width space makes HTML link unclickable

Inspired by Nawed's answer and devdob's comments, I decided to add a suffix when the link text didn't contain any letters or numbers. That will avoid corrupting any useful Unicode text, but it should always provide a usable link.

Unfortunately, it's not smart about Unicode. Chinese words, for example, will always get the link suffix.

Some options for improvement:

  1. Find a list of all the invisible Unicode characters and look for them explicitly. I'm worried that new invisible Unicode characters could be added in the future.
  2. List more blocks of visible Unicode characters beyond Latin letters and numbers. This seems like the best option, but I think my application will be fine as long as it doesn't Unicode names.

function addLink($p, linkText) {  if ( ! /\w/u.test(linkText)) {    linkText += ' [link]';  }  $p.append($('<a>')    .text(linkText)    .attr('href', 'javascript:alert("clicked.")'));}var linkText1 = 'foo', linkText2 = '\u200b', linkText3 = '网站';addLink($('#ex1'), linkText1);addLink($('#ex2'), linkText2);addLink($('#ex3'), linkText3);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script><p id="ex1">Example 1 </p><p id="ex2">Example 2 </p><p id="ex3">Example 3 </p>

General method to trim non-printable characters in Clojure

I believe, what you are referring to are so-called non-printable characters. Based on this answer in Java, you could pass the #"\p{C}" regular expression as pattern to replace:

(defn remove-non-printable-characters [x]
(clojure.string/replace x #"\p{C}" ""))

However, this will remove line breaks, e.g. \n. So in order to keep those characters, we need a more complex regular expression:

(defn remove-non-printable-characters [x]
(clojure.string/replace x #"[\p{C}&&^(\S)]" ""))

This function will remove non-printable characters. Let's test it:

(= "sample" "​sample")
;; => false

(= (remove-non-printable-characters "sample")
(remove-non-printable-characters "​sample"))
;; => true

(remove-non-printable-characters "sam\nple")
;; => "sam\nple"

The \p{C} pattern is discussed here.



Related Topics



Leave a reply



Submit