토큰화의 내부 원리

📅3 hours ago⏱2 min read

𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗨𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗛𝗼𝗼𝗱

You deploy a chatbot. English queries use 42 tokens. A Spanish user sends one query and it uses 103 tokens. Suddenly, your API costs jump 40%.

This happens when you treat tokenization as invisible plumbing. Every large language model uses one of four subword algorithms. Your choice determines vocabulary size, language efficiency, and your monthly bill.

Tokenization controls three critical things:

Inference cost. LLM APIs charge by token. A small vocabulary might break one word into 8 tokens. A large vocabulary handles it in 3. This difference costs real money at scale.
Vocabulary coverage. Poor vocabularies create longer sequences. This leads to slower generation and higher costs.
Model behavior. If a tokenizer splits "cowboy" into ["cow", "boy"], the model learns differently than if it splits it into ["c", "owb", "oy"].

Here is how the four main types work:

BPE (Byte Pair Encoding)

How it works: It starts with characters. It counts frequent adjacent pairs and merges them into new tokens. It repeats this until it reaches a target size.
Pros: Fast and deterministic.
Users: GPT-4o, Llama 3, Mistral.

WordPiece

How it works: Similar to BPE but uses likelihood instead of raw frequency. It picks merges that maximize the probability of the training data.
Pros: Creates more linguistically meaningful tokens.
Users: BERT, Google models.

SentencePiece

How it works: It treats input as raw Unicode bytes. It does not need a pre-tokenization step like splitting on spaces.
Pros: Best for multilingual support because it is language-agnostic.
Users: Llama 2, Llama 3, Gemma.

Unigram

How it works: It starts with a huge vocabulary and prunes it down using a probabilistic model. It picks the best segmentation path.
Pros: More consistent token-to-meaning mapping.
Users: T5, XLNet.

Key Takeaways for Developers:

Watch your language mix. BPE models that rely on spaces struggle with languages like Japanese or Hindi. Use SentencePiece for global products.
Pin your versions. Moving from cl100k_base to o200k_base changes your token counts. Always track which encoding you use in evaluations.
Benchmark correctly. Do not compare token counts between different model families. Always benchmark using character or byte counts to stay accurate.

이러한 도구들을 이해하면 재무 팀을 당황하게 만드는 대신, 비용 효율적인 제품을 출시할 수 있습니다.

출처: https://dev.to/tech_nuggets/tokenization-under-the-hood-bpe-wordpiece-sentencepiece-and-unigram-compared-4ca5

선택 사항 학습 커뮤니티: https://t.me/GyaanSetuAi

토큰화의 내부 원리

Continue reading

𝗕𝗿𝗮𝗻𝗱 𝗩𝗼𝗶𝗰𝗲 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗔𝗜 𝗧𝗼𝗼𝗹𝘀

에이전트 컴퓨팅 비용 절감

𝗧𝗵𝗲 𝗧𝗿𝗮𝗽 𝗼𝗳 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴

MCP의 추악한 비밀: 당신의 에이전트가 토큰을 낭비하고 있습니다

MCP 컨텍스트 세금