Invisible Unicode and compound emoji

A lot goes on in Unicode that isn’t immediately visible to the reader. This article looks at some potentially useful code points that can only be seen by their effects:

ZWJ, the zero-width joiner U+200D
ZWNJ, the zero-width non-joiner U+200C
ZWS, the zero-width space U+200B
WJ, the word joiner U+2060
ZWNBSP or BOM, the zero-width non-breaking space or byte order mark U+FEFF.

ZWJ and compound emoji

Unicode U+200D is used to compound adjoining characters or emoji into a new character or emoji. Although this is widely used in some Indic and other scripts, as detailed in Wikipedia, it has also become popular for extending emoji. For example,
+ ZWJ + =
being a ZWJ compound formed from U+1F469, U+200D, and U+1F680. But your system, browser, and font may display that just as the two emoji instead.

If you paste that compound emoji into the Characters pane in macOS, it should display the compound character and tell you how it’s constructed, although, unless you have a heavily customised keyboard layout, you can’t enter any of those three constituent characters from your keyboard, let alone the entire compound.

There’s an extensive listing of most of the more popular ZWJ compounds here on Emojipedia, each with a link to explain its constituents in full, so you can have fun with them at home.

Another couple of examples illustrate how complex these can become:
‘man facepalming’ is made up from U+1F926 + U+F3FE + U+200D + U+2642 + U+FE0F, and
a man, a woman, and a girl compound into a family with mother, father and daughter .

These ZWJ compounds are different from those emoji that can be modified for skin colour, known sometimes as the Fitzpatrick types, from the eponymous scale used to classify skin tone. For example, these show the basic ‘girl’ emoji U+1F467 in the following variants:

which are made by compounding ‘girl’ U+1F467 with Fitzpatrick types none, and U+1F3FB to U+1F3FF.

ZWNJ and ligatures

Some combinations of Unicode characters give rise to ligatures, compound characters that are typeset into one. Sometimes it’s necessary to set two Unicode characters next to one another that would then become joined, but they’re required to be kept separate. In those cases, place the zero-width non-joiner ZWNJ U+200C between them to prevent ligature formation. This can appear in Persian and other scripts, and is illustrated in Wikipedia.

ZWS and line breaks

When typesetting text into lines it’s often useful to have indicators of where line breaks, or word wrapping, can be made without hyphenation. These can be inserted into Unicode text using the zero-width space ZWS, U+200B. Wikipedia explains them further, and gives markup equivalents as well, such as <wbr> in HTML.

WJ and joining words

Just as the ZWS marks where line breaks can be made, the word joiner WJ, U+2060, does the exact opposite and invisibly shows where line breaks shouldn’t occur. This replaces the zero-width no-break space ZWNBSP, which is now the byte order mark BOM instead.

ZWNBSP and byte order

It’s sometimes useful to preface a string of Unicode with an indicator that it’s Unicode, that also informs software of the byte order (endianness) used in the string, and the type of encoding used, such as UTF-16. To do this, the first character in the string is U+FEFF, formerly known as the zero-width no-break space ZWNBSP (now preferred as the WJ instead). This is fully explained in Wikipedia.

Use in spoofing and exploits

Under ICANN rules, domain names aren’t allowed to contain characters that aren’t visible to the reader, so fears of being hijacked by spoofed domain names should never be realised. But I’m sure malicious minds can come up with other uses for them. Meanwhile we can always be creative with compound emoji.