Токенизация под капотом

📅3 hours ago⏱2 min read

𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗨𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗛𝗼𝗼𝗱

You deploy a chatbot. English queries use 42 tokens. A Spanish user sends one query and it uses 103 tokens. Suddenly, your API costs jump 40%.

This happens when you treat tokenization as invisible plumbing. Every large language model uses one of four subword algorithms. Your choice determines vocabulary size, language efficiency, and your monthly bill.

Tokenization controls three critical things:

Inference cost. LLM APIs charge by token. A small vocabulary might break one word into 8 tokens. A large vocabulary handles it in 3. This difference costs real money at scale.
Vocabulary coverage. Poor vocabularies create longer sequences. This leads to slower generation and higher costs.
Model behavior. If a tokenizer splits "cowboy" into ["cow", "boy"], the model learns differently than if it splits it into ["c", "owb", "oy"].

Here is how the four main types work:

BPE (Byte Pair Encoding)

How it works: It starts with characters. It counts frequent adjacent pairs and merges them into new tokens. It repeats this until it reaches a target size.
Pros: Fast and deterministic.
Users: GPT-4o, Llama 3, Mistral.

WordPiece

How it works: Similar to BPE but uses likelihood instead of raw frequency. It picks merges that maximize the probability of the training data.
Pros: Creates more linguistically meaningful tokens.
Users: BERT, Google models.

SentencePiece

How it works: It treats input as raw Unicode bytes. It does not need a pre-tokenization step like splitting on spaces.
Pros: Best for multilingual support because it is language-agnostic.
Users: Llama 2, Llama 3, Gemma.

Unigram

How it works: It starts with a huge vocabulary and prunes it down using a probabilistic model. It picks the best segmentation path.
Pros: More consistent token-to-meaning mapping.
Users: T5, XLNet.

Key Takeaways for Developers:

Watch your language mix. BPE models that rely on spaces struggle with languages like Japanese or Hindi. Use SentencePiece for global products.
Pin your versions. Moving from cl100k_base to o200k_base changes your token counts. Always track which encoding you use in evaluations.
Benchmark correctly. Do not compare token counts between different model families. Always benchmark using character or byte counts to stay accurate.

Понимание этих инструментов помогает выпускать экономически эффективные продукты, а не удивлять финансовые отделы.

Источник: https://dev.to/tech_nuggets/tokenization-under-the-hood-bpe-wordpiece-sentencepiece-and-unigram-compared-4ca5

Дополнительное обучающее сообщество: https://t.me/GyaanSetuAi

Токенизация под капотом

Continue reading

Обучение ИИ-инструментов голосу бренда

Снижение затрат на вычисления агентов

𝗧𝗵𝗲 𝗧𝗿𝗮𝗽 𝗼𝗳 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴

Грязный секрет MCP: ваш агент сжигает токены

Налог на контекст MCP