A Width Check Said the String Was Safe to Cut. It Split a Kanji in Half.
A name entered a terminal table and came out broken. The surname was 𠮷田.
The first character is not the common 吉. It is 𠮷 (U+20BB7). This is a rare form used in real Japanese family names. The table truncated the cell to fit a column. It left behind a broken character.
The bug lived in a single line of code. It was an optimization that decided a string was safe to cut by index.
A JavaScript string has three different lengths: • Code units (.length): "𠮷".length is 2. • Code points: [..."𠮷"].length is 1. • Display width: 𠮷 takes up 2 columns.
For standard English text, these numbers are all the same. This coincidence makes code look safe.
The character 𠮷 breaks this rule. It has 2 code units because it is a surrogate pair. It has 2 columns because it is a wide character. The numbers match (2 = 2), but for different reasons.
The library cli-table3 used a fast path: If code unit length equals display width, then cut the string by index.
This worked for years because common Japanese characters like 漢 have a length of 1 and a width of 2. They never hit the fast path.
The fast path only triggers for rare characters like 𠮷 or emojis. These characters have a length of 2 and a width of 2. The code thinks they are simple one-unit characters. It cuts them in half by index. This leaves a lone surrogate behind. This is why the terminal shows a broken box.
To fix this, you must:
- Guard the fast path to exclude surrogate pairs.
- Trim by code points instead of code units.
Using Array.from(str) helps because it iterates by code point. This ensures you never cut a character in half.
The lesson is simple: Never measure by one unit and cut by another. If you measure display width or code points, you must cut using those same units.
Test your code with a CJK Extension B character or an emoji. ASCII will never reveal this bug.
Optional learning community: https://greymoth-jp.github.io/cjk-failure-corpus/
