AI Concepts

Tokenizer

Q: Why Tokenization Exists

Definition: A tokenizer is the preprocessing component that converts raw text into the discrete symbolic units — tokens — a large language model can read. Tokens are typically sub-word pieces: common words stay whole ("the"), rare words fragment ("Taskade" → "Task" + "ade"), and punctuation gets its own token. Every LLM has exactly one tokenizer, trained alongside or before the model, and the choice of tokenizer determines vocabulary, context length, and how much text fits in a context window. Tokenization is invisible until it hurts. It is the reason your Japanese prompt costs three times more than the equivalent English, why emoji explode your token count, and why a 128K-context model sometimes chokes on a 100K-character document. Every LLM bill, every context window calculation, and every prompt engineering optimization runs through the tokenizer. Raw text is a stream of Unicode characters. Neural networks operate on discrete numeric IDs. Tokenization is the bridge. The question is only: at what granularity? graph LR subgraph CharLevel["Character level — too fine"] C1[T]:::char --> C2[a]:::char --> C3[s]:::char --> C4[k]:::char --> C5[a]:::char --> C6[d]:::char --> C7[e]:::char end subgraph WordLevel["Word level — too coarse"] W1[Taskade]:::word end subgraph Subword["Sub-word — just right"] S1[Task]:::sub --> S2[ade]:::sub end classDef char fill:#1e1e2e,color:#94a3b8,stroke:#475569 classDef word fill:#1a0d0d,color:#ef9a9a,stroke:#b71c1c classDef sub fill:#0d1a0e,color:#86efac,stroke:#22c55e Character-level tokenization produces sequences 5-10x longer than needed. Word-level tokenization creates a vocabulary too large to train and fails on unseen words. Sub-word tokenization — the modern default — covers everything: common words stay whole, rare words fragment into reusable pieces, and no input ever produces an "unknown token."

Q: What One Token Looks Like

A token can be: A whole common word: the, and, cat A partial word: Task, ade, ing, ly A single character: !, ?, a A space-prefixed word piece: cat (the leading space is part of the token) A single byte (in byte-BPE): 0xE2, 0x98, 0x83 (☃ takes three tokens) The last case is why emoji are expensive. A single emoji like 🎯 is one Unicode code point but three UTF-8 bytes, and in byte-level BPE it costs three tokens — ten times the cost of a common ASCII word. English: "Hello, world" → 3 tokens Japanese: "こんにちは、世界" → ~12 tokens (4x more) Emoji: "🎯🚀✨" → ~9 tokens Code: "function foo() {}" → ~6 tokens

Q: How do I count tokens accurately?

Use the model's official tokenizer (OpenAI's tiktoken, Anthropic's tokenizer endpoint, Google's Gemini tokenizer). Character counts and word counts are approximations; only the exact tokenizer gives the real number.

7 min read

On this page (14)

Definition: A tokenizer is the preprocessing component that converts raw text into the discrete symbolic units — tokens — a large language model can read. Tokens are typically sub-word pieces: common words stay whole ("the"), rare words fragment ("Taskade" → "Task" + "ade"), and punctuation gets its own token. Every LLM has exactly one tokenizer, trained alongside or before the model, and the choice of tokenizer determines vocabulary, context length, and how much text fits in a context window.

Tokenization is invisible until it hurts. It is the reason your Japanese prompt costs three times more than the equivalent English, why emoji explode your token count, and why a 128K-context model sometimes chokes on a 100K-character document. Every LLM bill, every context window calculation, and every prompt engineering optimization runs through the tokenizer.

Why Tokenization Exists

Raw text is a stream of Unicode characters. Neural networks operate on discrete numeric IDs. Tokenization is the bridge. The question is only: at what granularity?

Character-level tokenization produces sequences 5-10x longer than needed. Word-level tokenization creates a vocabulary too large to train and fails on unseen words. Sub-word tokenization — the modern default — covers everything: common words stay whole, rare words fragment into reusable pieces, and no input ever produces an "unknown token."

The Three Dominant Algorithms

Byte-Pair Encoding (BPE). Starts with individual characters and iteratively merges the most frequent pair. "th" becomes a token because "t" and "h" appear together often. After enough merges, common words like "the" survive as single tokens; rare words stay fragmented. Used by GPT-2, GPT-3, GPT-4, GPT-5, Claude, and most OpenAI-family models.

WordPiece. Similar to BPE but chooses merges based on likelihood improvement rather than raw frequency. Used by BERT and the original ALBERT/DistilBERT family.

SentencePiece (Unigram). Trains on raw text without assuming whitespace delimits words — crucial for Japanese, Chinese, Thai. Iteratively removes least-useful tokens from a large starting vocabulary. Used by Gemini, LLaMA, T5, Mistral, and most modern multilingual models.

Algorithm	Approach	Used By
BPE	Merge most frequent pair	GPT-2 → GPT-5, Claude
WordPiece	Merge based on likelihood	BERT, DistilBERT
SentencePiece (Unigram)	Prune low-utility tokens	Gemini, LLaMA, Mistral
Byte-level BPE	BPE on raw UTF-8 bytes	GPT-4, Claude (handles any Unicode)

Modern models often use byte-level BPE — the tokenizer operates on UTF-8 bytes instead of Unicode code points. This guarantees any input can be encoded, at the cost of fragmenting non-Latin scripts into many bytes.

What One Token Looks Like

A token can be:

A whole common word: the, and, cat
A partial word: Task, ade, ing, ly
A single character: !, ?, a
A space-prefixed word piece: cat (the leading space is part of the token)
A single byte (in byte-BPE): 0xE2, 0x98, 0x83 (☃ takes three tokens)

The last case is why emoji are expensive. A single emoji like 🎯 is one Unicode code point but three UTF-8 bytes, and in byte-level BPE it costs three tokens — ten times the cost of a common ASCII word.

English:     "Hello, world"       →  3 tokens
Japanese:    "こんにちは、世界"      →  ~12 tokens  (4x more)
Emoji:       "🎯🚀✨"              →  ~9 tokens
Code:        "function foo() {}"   →  ~6 tokens

Token Count Drives Everything

The tokenizer is the pricing axis for every LLM in production. Every API bill, every context window limit, every cache key is measured in tokens, not characters or words.

Approximate token-to-character ratios (English)

  1 token  ≈  4 characters
  1 token  ≈  0.75 words
  1 page   ≈  500 tokens
  1 book   ≈  100,000 tokens

These ratios are rough averages. Code, non-Latin text, and formatted documents can all run 2–4x higher token counts than their word counts suggest. For anything cost-sensitive, measure the actual tokenizer output, do not estimate.

Tokenization and Model Behavior

Tokenization does not just affect billing — it affects behavior:

Arithmetic fragility. Numbers like 1234567 tokenize differently than 1,234,567. Some tokenizers split 1234 into 12, 34. This breaks arithmetic unless the model has seen enough examples of each split pattern.

Trailing-space bias. Tokens starting with a space ( cat) are different from tokens without (cat). Prompts that end with a trailing space can shift output distributions unpredictably.

Vocabulary holes. If "Taskade" tokenizes into ["Task", "ade"] in training but ["T", "ask", "ade"] after a tokenizer update, every reference to the brand behaves differently. Tokenizers are frozen for the model's lifetime.

Cross-language tax. Non-English text costs 2–4x more tokens per semantic unit. A 32K-context model with a 32K-character Japanese document will run out of room.

Tokenization in Taskade

You never touch a tokenizer inside Taskade. The platform auto-routes between 15+ frontier models from OpenAI, Anthropic, Google, and leading open-source providers, each with its own tokenizer. The credit accounting handles token counts under the hood; what you see is credits-per-action, normalized across models.

When an agent reads a long Taskade project, the platform chunks and embeds at a token-aware boundary to stay inside the model's context window. When a Taskade Genesis app processes a large document, the same token-aware chunking applies. The tokenizer is infrastructure; the user experience is "paste anything, it works."

Token — The unit tokenizers produce
Context Window — Measured in tokens
Large Language Models — Every LLM has a tokenizer
Transformer — Consumes the tokenizer output
Embeddings — Computed per-token, pooled into vectors
Prompt Engineering — Partly about token economy

Frequently Asked Questions About Tokenizers

What is a tokenizer in AI?

A tokenizer is the component that converts raw text into the discrete tokens a large language model can read. Tokens are typically sub-word pieces: common words stay whole, rare words fragment into reusable parts, and punctuation gets its own token.

How many tokens is a word?

For English, roughly 1 word ≈ 1.3 tokens. A 1,000-word document is about 1,300 tokens. Non-English text, code, and emoji have higher ratios — Japanese is typically 3–4x more tokens per word than English.

Why do emoji cost so many tokens?

Modern LLMs use byte-level BPE tokenization. A single emoji is one Unicode code point but often three or four UTF-8 bytes, each costing a token. An emoji can easily be 3–4x the cost of a common ASCII word.

Does Taskade let me pick a tokenizer?

No. Taskade auto-routes between 15+ frontier models from OpenAI, Anthropic, Google, and leading open-source providers, each with its own tokenizer. Credit accounting normalizes token costs under the hood, so you see credits-per-action instead of raw tokens.

How do I count tokens accurately?

Use the model's official tokenizer (OpenAI's tiktoken, Anthropic's tokenizer endpoint, Google's Gemini tokenizer). Character counts and word counts are approximations; only the exact tokenizer gives the real number.

Tokenizer

Why Tokenization Exists

The Three Dominant Algorithms

What One Token Looks Like

Token Count Drives Everything

Tokenization and Model Behavior

Tokenization in Taskade

Frequently Asked Questions About Tokenizers

What is a tokenizer in AI?

How many tokens is a word?

Why do emoji cost so many tokens?

Does Taskade let me pick a tokenizer?

How do I count tokens accurately?

Further Reading

Related Wiki Pages

Tokenizer

Why Tokenization Exists

The Three Dominant Algorithms

What One Token Looks Like

Token Count Drives Everything

Tokenization and Model Behavior

Tokenization in Taskade

Related Concepts

Frequently Asked Questions About Tokenizers

What is a tokenizer in AI?

How many tokens is a word?

Why do emoji cost so many tokens?

Does Taskade let me pick a tokenizer?

How do I count tokens accurately?

Further Reading

Related Wiki Pages