How I Added Typo-Tolerant Search With OpenSearch and CJK

Zero-result queries were killing our watch time.

For a year, TopVideoHub used SQLite FTS5 for search. It worked for exact matches. It failed when users made typos.

People searched for "demon slyer" instead of "demon slayer." They added stray spaces in Japanese or Korean titles. Because FTS5 matches exact tokens, a single typo meant zero results. Users would not search again. They just left.

I moved our title search to OpenSearch. Here is how I solved it for a multilingual audience.

The Problem With Standard Fuzziness

Most tutorials tell you to use "fuzziness" to fix typos. This works for English, but it fails for Chinese, Japanese, and Korean (CJK) text.

  • Edit distance is bad for CJK. A typo in CJK often means using the wrong character entirely.
  • Standard fuzziness creates semantic garbage. It might match "fire" with "water" because they are one edit apart.
  • Tokenization is difficult. CJK languages do not use spaces between words.

The Solution: A Multi-Field Approach

I stopped treating all text the same. I created one logical title field that indexes three different ways:

  • Latin and Romaji: I used standard tokenization with fuzziness enabled. I set a prefix length of 1. This ensures "demon" matches "demn" but "lemon" does not match "demon."
  • CJK Text: I used a CJK bigram analyzer and an ICU normalizer. I turned fuzziness OFF. Instead, I used a minimum match threshold of 70%.
  • Autocomplete: I used an edge-ngram field to power search-as-you-type results.

Architecture and Data Safety

I kept SQLite as the single source of truth. OpenSearch acts as a fast, rebuildable index.

  • I use PHP to push updates to OpenSearch in bulk batches.
  • I never run indexing during a user request.
  • I run a Python script to check for "drift." This ensures the OpenSearch count matches the SQLite count.

The Results

The change was massive:

  • Zero-result queries dropped from 14% to under 3%.
  • Search sessions became longer because users found what they wanted immediately.
  • Latency remains low at around 40ms.

If you serve a multilingual audience, remember: typo tolerance and CJK matching are two different problems. You need two different solutions.

Source: https://dev.to/ahmet_gedik778845/how-i-added-typo-tolerant-video-title-search-with-opensearch-and-cjk-3e5d