A Width Check Broke a Kanji

A name went into a terminal table and came out broken. The surname was 𠮷田.

The first character is not the common 吉. It is 𠮷 (U+20BB7). This is a rare form used in real Japanese family names. The table truncated the cell to fit a column. Instead of a name, it printed a broken character. The kanji was split in half.

The bug lived in a one-line shortcut. The code decided a string was safe to cut by index before actually truncating it. This logic failed because of how JavaScript handles strings.

A JavaScript string has three different lengths:

  • Code unit length: "𠮷".length is 2. This counts UTF-16 units.
  • Code point count: [..."𠮷"].length is 1. This counts actual characters.
  • Display width: The number of columns it takes in a terminal is 2.

For plain English text, these numbers are the same. "abc" has 3 units, 3 points, and 3 columns. Most code assumes this coincidence is a rule.

The character 𠮷 breaks that rule. It has 2 code units and 2 columns. The numbers match, but for different reasons. The code saw 2 equals 2 and used a fast path to cut the string by index.

When it cut the string at index 3, it took the first full character and only half of the second one. This left a lone surrogate behind. Terminals show this as a broken box.

Common Japanese characters like 漢 are safe. They have 1 code unit and 2 columns. Since 1 does not equal 2, the code avoids the broken shortcut. The bug only hits rare characters and emojis.

To fix this, you must:

  • Guard the fast path to reject strings with high surrogates.
  • Trim by whole code points instead of code units.

Using Array.from(str) fixes this because it iterates by code point. It treats the character as one whole unit.

The lesson is simple: never measure by one unit and cut by another. If you measure display width but cut by code unit index, you will break your users' data.

Test your code with a rare CJK character or an emoji. ASCII will not show you these errors. You must provide the input your code is afraid of.

Source: https://dev.to/greymothjp/a-width-check-said-the-string-was-safe-to-cut-it-split-a-kanji-in-half-4hjk

Optional learning community: https://greymoth-jp.github.io/cjk-failure-corpus/