Zero-Width Non-Breaking Space

Zero-width non-breaking space

The No-Break Space is very similar to a Word-Joiner, just as it's very similar to a Space. But each, has very different usages. All of these variations exist to represent different widths and functions of a space character.

  • U+00A0 No-Break Space   represented similarly to a space character, it prevents an automatic line break.
  • U+2007 Figure Space a space somewhat equal to the figures (0–9) characters.
  • U+202F Narrow No-Break Space or &nnbsp;) used to separate a suffix from a word stem without indicating a word boundary. Approximately 1/3 the representative space of a normal space though it may vary by font.
  • U+2060 Word-Joiner representative by no visible character, it prohibits a line break at its position.

Other No-Break Characters

  • NON-BREAKING HYPHEN (U+2011)
  • FIGURE SPACE (U+2007)
  • NARROW NO-BREAK SPACE (U+202F)
  • TIBETAN MARK DELIMITER TSHEG BSTAR (U+0F0C)

W3C recommends using the Word-Joiner whenever you need to connect two characters or words so that they do not wrap. [1]

To get the same functionality, formerly provided through the ZERO-WIDTH NON-BREAKING SPACE authors should use a WORD JOINER (U+2060) instead

However, nowhere in the HTML4 Character Reference is Word-Joiner mentioned. [2]

In addition to these characters the SOFT HYPHEN (U+00AD) can be used to provide line-break hints within words that UAs might not have in their own hyphenation dictionaries.

The only characters that are explicitly discouraged are ZERO WIDTH NON-JOINER (U+200C): prevents ligation and cursive connections between characters that would other wise ligate or join cursively.

  • ZERO WIDTH JOINER (U+200D): encourages ligation and cursive connections.


References:

  1. W3C Wiki: HTML Character Usage
  2. Character entity references in HTML 4


Further:

  1. Unicode.org Correction of Word_Break Property Value for U+00A0 NBSP
  2. Unicode v.3.2.0 Line Break Properties
  3. Unicode ?Proposed? Line Breaking Properties
  4. Unicode v7 Complete Standards
  5. Unicode Explained by Jukka Korpela

Zero-width space vs zero-width non-joiner

A zero-width non-joiner is almost non-existing. Its only purpose is to split things into two. For example, 123 zero-width-non-joiner 456 is two numbers with nothing in between.

A zero-width space is a space character, just a very very narrow one. For example 123 zero-width-space 456 is two numbers with a space character in between.

How to make a hair space and thin space non-breaking?

U+2060 WORD JOINER is an invisible, zero-width character whose only effect is that it prevents line breaks at its position. Putting WORD JOINER after any whitespace character will therefore make it non-breaking. (Putting a WJ before the whitespace is not necessary because all spaces in Unicode automatically inhibit line-breaking immediately before them.)

UPD1 from the comments below:

The Unicode Standard absolutely requires WJ to retain its break-blocking property. However, there is a 7-years-old bug in Firefox with implementing the Unicode line-breaking algorithm, so in Firefox a combination whitespace+WJ works just like the whitespace without WJ.

In Chrome and Safari WJ works according to the standard.

UPD2 from the discussion in Twitter (credit to CharlotteBuff and FakeUnicode):

  • It is possible to use U+034F COMBINING GRAPHEME JOINER in place of U+2060 WORD JOINER as a workaround for Firefox. It prevents line-breaking in Firefox, but not in Chrome and Safari. Also such use of U+034F CGJ is semantically dubious because CGJ has a different intended purpose. Nevertheless it is highly unlikely that U+034F CGJ can cause any kind of accessibility problems (with screenreaders and similar stuff).

  • It is also possible to use U+200D ZERO WIDTH JOINER to prevent line-breaking. ZWJ works fine in the "big three" (Chrome, Safari, Firefox). However, ZWJ can have undesirable effects on the appearance of the surrounding text, because its purpose is to induce cursive joining in places where it otherwise wouldn’t happen automatically (e.g. in Arabic: م + ث + ل = مثل). If there are fancy ligatures in a font of the text with ZWJ, ZWJ can (and probably will) cause them to be triggered and change the shape of symbols nearby.

Note: the Wiki denotes ZWJ as «May break: Yes», but it is mistaken.

So, here are all the options to be seen and checked in your browser:

div {
outline: 1px solid red;
width: 20px;
margin: 10px;
}
<!-- U+00A0 NO-BREAK SPACE -->
<!-- just for example -->
<div>hello world1</div>

<!-- U+200A HAIR SPACE + U+2060 WORD JOINER -->
<!-- works fine, but does not work in FF -->
<div>hello ⁠world2</div>

<!-- U+200A HAIR SPACE + U+034F COMBINING GRAPHEME JOINER -->
<!-- works, but 1) semantically incorrect and 2) does not work in Chrome and Safari-->
<div>hello ͏world3</div>

<!-- U+200D ZERO WIDTH JOINER + U+200A HAIR SPACE + U+200D ZERO WIDTH JOINER -->
<!-- works, even in FF! -->
<div>hello‍ ‍world4</div>

<!-- U+200A HAIR SPACE + U+200D ZERO WIDTH JOINER -->
<!-- works, even in FF! -->
<div>hello ‍world5</div>

HTML - How to remove Zero-Width No Break Space

In my case, I opened up all files that I thought were be an impact into Notepad++, then on each file (in Notepad++), I changed the encoding to Encode in ANSI. One of the files had some hieroglyphics on top (line 1). Removed that and saved back to normal encoded state.

How to remove zero-width space characters ‍ from the text

The string is the HTML character entity for the zero-width joiner. When a web browser sees it it will replace it with an actual zero-width joiner, but as far as Ruby is concerned it is just a 5 character string.

What you want to do is to specify the actual zero-width joiner character. It has the codepoint U+200D, so you can use it like this, using Ruby’s Unicode escape:

text.gsub("\u200D", "")

This should remove the zero-width joiner characters, rather than looking for the string which your original code was doing.

Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).

If you read the FileReader documentation, it says:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:

(from https://stackoverflow.com/a/13988345/65863)

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

Replacing `Zero Width No-Break Space` with `space` in R

As I examined your text 'خيابان‌ مولوي‌نرسيده‌به‌قيام‌' on my terminal, I got:

>>> خيابان\U+200C مولوي\U+200Cنرسيده\U+200Cبه\U+200Cقيام\U+

and converted all these chars to hex in python shell, I got:

>>> binascii.unhexlify(binascii.hexlify(u"خيابان\U+200C مولوي\U+200C نرسيده\U+200C به\U+200C قيام\U+200C".encode('utf-16'))).decode('utf-16')
u'\u062e\u064a\u0627\u0628\u0627\u0646\u200c \u0645\u0648\u0644\u0648\u064a\u200c \u0646\u0631\u0633\u064a\u062f\u0647\u200c \u0628\u0647\u200c \u0642\u064a\u0627\u0645\u200c'

You will see that there is no \ufeff("ZERO WIDTH NO-BREAK SPACE") in the output of the program above. An another proof is here you will see that ǎ easily be matched but non of \x{feff} is existed.

Thus, the problem of yours is no "ZERO WIDTH NO-BREAK SPACE" in your string. I guess kind of space that you want to replace might be this one \u200C("ZERO WIDTH NON-JOINER").

The width of non breaking space ( 160 ) and normal space ( 32 ) are different

When we talk about a font’s spacing, or letter fit, we’re referring to the amount of space between the characters, which in turn gives the typeface its relative openness or tightness. A font’s spacing is initially determined by the manufacturer or designer and is somewhat size-dependent. Text designs tend to be spaced more openly than display faces. The reason? The smaller the point size, the more space is needed between letters to keep the characters legible. Conversely, as a typeface is set larger, a snugger fit between letters creates word-shapes that are easier to read.

Please check https://www.fonts.com/content/learning/fyti/using-type-tools/spacing-and-kerning-1



Related Topics



Leave a reply



Submit