A Width Check Said the String Was Safe to Cut. It Split a Kanji in Half.

Translated for your language. Read the original.

AI-assisted draft.

GyaanSetu Editorial18 ಗಂಟೆಗಳ ಹಿಂದೆ2min read

A Width Check Said the String Was Safe to Cut. It Split a Kanji in Half.

A name entered a terminal table and came out broken. The surname was 𠮷田.

The first character is not the common 吉. It is 𠮷 (U+20BB7). This is a rare form used in real Japanese family names. The table truncated the cell to fit a column. It left behind a broken character.

The bug lived in a single line of code. It was an optimization that decided a string was safe to cut by index.

A JavaScript string has three different lengths: • Code units (.length): "𠮷".length is 2. • Code points: [..."𠮷"].length is 1. • Display width: 𠮷 takes up 2 columns.

For standard English text, these numbers are all the same. This coincidence makes code look safe.

The character 𠮷 breaks this rule. It has 2 code units because it is a surrogate pair. It has 2 columns because it is a wide character. The numbers match (2 = 2), but for different reasons.

The library cli-table3 used a fast path: If code unit length equals display width, then cut the string by index.

This worked for years because common Japanese characters like 漢 have a length of 1 and a width of 2. They never hit the fast path.

The fast path only triggers for rare characters like 𠮷 or emojis. These characters have a length of 2 and a width of 2. The code thinks they are simple one-unit characters. It cuts them in half by index. This leaves a lone surrogate behind. This is why the terminal shows a broken box.

To fix this, you must:

Guard the fast path to exclude surrogate pairs.
Trim by code points instead of code units.

Using Array.from(str) helps because it iterates by code point. This ensures you never cut a character in half.

The lesson is simple: Never measure by one unit and cut by another. If you measure display width or code points, you must cut using those same units.

Test your code with a CJK Extension B character or an emoji. ASCII will never reveal this bug.

Source: https://dev.to/greymothjp/a-width-check-said-the-string-was-safe-to-cut-it-split-a-kanji-in-half-4hjk

Optional learning community: https://greymoth-jp.github.io/cjk-failure-corpus/

A Width Check Said the String Was Safe to Cut. It Split a Kanji in Half.

Continue reading

ನಿಮ್ಮ ನಂಬಿಕಸ್ತ ಲೈಬ್ರರಿಗಳಲ್ಲಿ ಅಡಗಿರುವ ಅದೇ ಕೆಲವು ದೋಷಗಳು

ಮೂರು ಪ್ರಯತ್ನಗಳು, ನಂತರ ನಿಲ್ಲಿಸಿ

A Width Check Broke a Kanji