download dots
Tokenizer

Tokenizer

7 min read
On this page (14)

Definition: A tokenizer is the preprocessing component that converts raw text into the discrete symbolic units โ€” tokens โ€” a large language model can read. Tokens are typically sub-word pieces: common words stay whole ("the"), rare words fragment ("Taskade" โ†’ "Task" + "ade"), and punctuation gets its own token. Every LLM has exactly one tokenizer, trained alongside or before the model, and the choice of tokenizer determines vocabulary, context length, and how much text fits in a context window.

Tokenization is invisible until it hurts. It is the reason your Japanese prompt costs three times more than the equivalent English, why emoji explode your token count, and why a 128K-context model sometimes chokes on a 100K-character document. Every LLM bill, every context window calculation, and every prompt engineering optimization runs through the tokenizer.

Why Tokenization Exists

Raw text is a stream of Unicode characters. Neural networks operate on discrete numeric IDs. Tokenization is the bridge. The question is only: at what granularity?

Character level โ€” too fine Word level โ€” too coarse Sub-word โ€” just right T a s k a d e Taskade Task ade

Character-level tokenization produces sequences 5-10x longer than needed. Word-level tokenization creates a vocabulary too large to train and fails on unseen words. Sub-word tokenization โ€” the modern default โ€” covers everything: common words stay whole, rare words fragment into reusable pieces, and no input ever produces an "unknown token."

The Three Dominant Algorithms

Byte-Pair Encoding (BPE). Starts with individual characters and iteratively merges the most frequent pair. "th" becomes a token because "t" and "h" appear together often. After enough merges, common words like "the" survive as single tokens; rare words stay fragmented. Used by GPT-2, GPT-3, GPT-4, GPT-5, Claude, and most OpenAI-family models.

WordPiece. Similar to BPE but chooses merges based on likelihood improvement rather than raw frequency. Used by BERT and the original ALBERT/DistilBERT family.

SentencePiece (Unigram). Trains on raw text without assuming whitespace delimits words โ€” crucial for Japanese, Chinese, Thai. Iteratively removes least-useful tokens from a large starting vocabulary. Used by Gemini, LLaMA, T5, Mistral, and most modern multilingual models.

Algorithm Approach Used By
BPE Merge most frequent pair GPT-2 โ†’ GPT-5, Claude
WordPiece Merge based on likelihood BERT, DistilBERT
SentencePiece (Unigram) Prune low-utility tokens Gemini, LLaMA, Mistral
Byte-level BPE BPE on raw UTF-8 bytes GPT-4, Claude (handles any Unicode)

Modern models often use byte-level BPE โ€” the tokenizer operates on UTF-8 bytes instead of Unicode code points. This guarantees any input can be encoded, at the cost of fragmenting non-Latin scripts into many bytes.

What One Token Looks Like

A token can be:

  • A whole common word: the, and, cat
  • A partial word: Task, ade, ing, ly
  • A single character: !, ?, a
  • A space-prefixed word piece: cat (the leading space is part of the token)
  • A single byte (in byte-BPE): 0xE2, 0x98, 0x83 (โ˜ƒ takes three tokens)

The last case is why emoji are expensive. A single emoji like ๐ŸŽฏ is one Unicode code point but three UTF-8 bytes, and in byte-level BPE it costs three tokens โ€” ten times the cost of a common ASCII word.

English:     "Hello, world"       โ†’  3 tokens
Japanese:    "ใ“ใ‚“ใซใกใฏใ€ไธ–็•Œ"      โ†’  ~12 tokens  (4x more)
Emoji:       "๐ŸŽฏ๐Ÿš€โœจ"              โ†’  ~9 tokens
Code:        "function foo() {}"   โ†’  ~6 tokens

Token Count Drives Everything

The tokenizer is the pricing axis for every LLM in production. Every API bill, every context window limit, every cache key is measured in tokens, not characters or words.

Approximate token-to-character ratios (English)

  1 token  โ‰ˆ  4 characters
  1 token  โ‰ˆ  0.75 words
  1 page   โ‰ˆ  500 tokens
  1 book   โ‰ˆ  100,000 tokens

These ratios are rough averages. Code, non-Latin text, and formatted documents can all run 2โ€“4x higher token counts than their word counts suggest. For anything cost-sensitive, measure the actual tokenizer output, do not estimate.

Tokenization and Model Behavior

Tokenization does not just affect billing โ€” it affects behavior:

Arithmetic fragility. Numbers like 1234567 tokenize differently than 1,234,567. Some tokenizers split 1234 into 12, 34. This breaks arithmetic unless the model has seen enough examples of each split pattern.

Trailing-space bias. Tokens starting with a space ( cat) are different from tokens without (cat). Prompts that end with a trailing space can shift output distributions unpredictably.

Vocabulary holes. If "Taskade" tokenizes into ["Task", "ade"] in training but ["T", "ask", "ade"] after a tokenizer update, every reference to the brand behaves differently. Tokenizers are frozen for the model's lifetime.

Cross-language tax. Non-English text costs 2โ€“4x more tokens per semantic unit. A 32K-context model with a 32K-character Japanese document will run out of room.

Tokenization in Taskade

You never touch a tokenizer inside Taskade. The platform auto-routes between 11+ frontier models from OpenAI, Anthropic, and Google, each with its own tokenizer. The credit accounting handles token counts under the hood; what you see is credits-per-action, normalized across models.

When an agent reads a long Taskade project, the platform chunks and embeds at a token-aware boundary to stay inside the model's context window. When a Genesis app processes a large document, the same token-aware chunking applies. The tokenizer is infrastructure; the user experience is "paste anything, it works."

Frequently Asked Questions About Tokenizers

What is a tokenizer in AI?

A tokenizer is the component that converts raw text into the discrete tokens a large language model can read. Tokens are typically sub-word pieces: common words stay whole, rare words fragment into reusable parts, and punctuation gets its own token.

How many tokens is a word?

For English, roughly 1 word โ‰ˆ 1.3 tokens. A 1,000-word document is about 1,300 tokens. Non-English text, code, and emoji have higher ratios โ€” Japanese is typically 3โ€“4x more tokens per word than English.

Why do emoji cost so many tokens?

Modern LLMs use byte-level BPE tokenization. A single emoji is one Unicode code point but often three or four UTF-8 bytes, each costing a token. An emoji can easily be 3โ€“4x the cost of a common ASCII word.

Does Taskade let me pick a tokenizer?

No. Taskade auto-routes between 11+ frontier models from OpenAI, Anthropic, and Google, each with its own tokenizer. Credit accounting normalizes token costs under the hood, so you see credits-per-action instead of raw tokens.

How do I count tokens accurately?

Use the model's official tokenizer (OpenAI's tiktoken, Anthropic's tokenizer endpoint, Google's Gemini tokenizer). Character counts and word counts are approximations; only the exact tokenizer gives the real number.

Further Reading