AI Concepts

Inference

7 min read

On this page (15)

Definition: Inference is the runtime phase of a large language model — the act of using a trained model to produce output from a new input. Training builds the model's weights; inference spends those weights to generate tokens one at a time. Every ChatGPT reply, every Claude thought, every Taskade agent response, every Taskade Genesis app build is an inference call somewhere.

Inference economics decide what AI products can ship. Training a frontier model is a multi-hundred-million-dollar event that happens once. Inference costs show up every single time a user sends a message. The cost curve of a successful AI product is dominated by inference, not training.

The Two Phases of LLM Inference

Prefill. The input prompt (thousands of tokens) is processed in parallel in a single forward pass through the transformer. The attention keys and values are cached for each layer and every position — the KV cache. Prefill is compute-heavy and fast because it can use the full GPU in parallel.

Decode. Output tokens are generated one at a time. Each new token needs the full KV cache from all prior positions plus the new token's contribution. Decode is memory-bandwidth-bound, not compute-bound, and runs far below GPU peak. Generating a 1,000-token response runs the model 1,000 separate times.

This split is why short prompts with long responses are expensive per character, and long prompts with short responses are cheap per character. The economics are asymmetric.

The KV Cache

The KV cache is the single most important optimization in LLM inference. Without it, every new token would require re-processing the entire sequence. With it, only the new token's contribution is computed.

Without KV cache: O(n²) per token   →  10x slower
With KV cache:    O(n) per token    →  standard

The KV cache grows linearly with sequence length. A 128K-context model with a batch of 10 concurrent conversations needs gigabytes of VRAM just for caches. This is why context-window extensions get harder the longer they stretch and why prefix caching (reusing KV entries for common system prompts) matters so much in production.

Anthropic's prompt caching feature, OpenAI's cached input tokens, and Google's context caching all expose this to developers: pay once for a prefix, reuse it across thousands of requests.

Batching

A GPU running a single conversation wastes ~95% of its compute. Inference servers pack many conversations together — batching — to amortize the memory-bandwidth bottleneck.

Static batching — Wait until N requests arrive, then run them together. Simple but adds latency.

Continuous batching (iteration-level batching). Requests enter and leave the batch every decode step. The throughput gain is typically 3–10x with no extra latency. This is the default in vLLM, TGI, and most production LLM servers.

Speculative decoding. A small, cheap "draft" model proposes the next few tokens; the big model verifies them in one prefill-style pass. When the draft is accepted, you get 2–4 tokens for the cost of one. Used by every major frontier vendor in 2026.

Latency Metrics That Matter

Two metrics define inference UX:

Time to First Token (TTFT). Seconds from request to the first streamed token. Dominated by prefill and queuing. Long prompts increase TTFT linearly.

Tokens Per Second (TPS). Throughput of the decode phase. On frontier hardware, modern frontier models run at 50–200 TPS per request. Streaming TPS below 40 feels sluggish; above 100 feels instant.

Together these define the perceived speed:

Response time = TTFT + (output_tokens / TPS)

Example:
  prompt:         4K tokens
  TTFT:           0.6 s
  output:         500 tokens
  TPS:            80
  total latency:  0.6 + 500/80 = 6.85 s

The Cost Side

Inference cost scales with:

Factor	Effect on Cost
Input tokens	Linear — prefill compute
Output tokens	Linear — decode time + memory
Context window size	Linear in KV cache memory
Model parameters	Roughly linear in compute
Batch size	Inverse — more sharing = cheaper
Cache hit rate	Inverse — fewer fresh prefills

Frontier models in 2026 charge roughly 3–10x more per output token than per input token, because decode is the expensive half. Long-output workloads like code generation or Taskade Genesis app builds feel the output tax sharply.

Inference Optimizations in Practice

Modern inference stacks layer five or more optimizations:

Continuous batching — pack requests together
Paged attention (PagedAttention) — virtual-memory-style KV cache layout, reduces fragmentation
Speculative decoding — draft + verify for 2–4x speedup
FlashAttention / FlashDecoding — memory-efficient attention kernels
Quantization — 8-bit or 4-bit weights, 2–4x memory saved at 1–2% quality loss
KV cache sharing — prefix caching across requests
Structured output constrained decoding — force outputs to a grammar or schema

The combination turns a 70B-parameter model from "too expensive" into "cost-effective" at production scale.

Inference in Taskade

You never manage inference inside Taskade. The platform auto-routes between 15+ frontier models from OpenAI, Anthropic, Google, and leading open-source providers, dispatching requests to whichever model fits the task and the user's plan. All the optimizations above — prompt caching, continuous batching, speculative decoding — happen inside the upstream providers and the Taskade inference gateway.

What you see is credits. A credit absorbs the full inference cost (input + output + cache behavior) normalized across models, so switching from one frontier model to another does not require you to re-architect anything. Business and Max plans have different credit bursts precisely because the underlying inference profile differs — Business favors request burst, Max favors credit burst for long-running Taskade Genesis builds.

Token — The unit inference produces
Context Window — Bounds prefill size
Tokenizer — Controls how much fits in a given window
Large Language Models — The thing being inferred
Transformer — The architecture that inference runs on
Scaling Laws — The compute curves inference rides on

Frequently Asked Questions About Inference

What is inference in AI?

Inference is the runtime phase of a large language model — using a trained model to produce output from a new input. Training builds the weights; inference spends them to generate tokens one at a time.

Why is LLM output slow?

LLM decode is memory-bandwidth-bound, not compute-bound. Each new token requires reading the full KV cache, one token at a time. Modern systems use continuous batching and speculative decoding to lift throughput, but the sequential nature of autoregressive generation is fundamental.

What is the KV cache?

The KV cache stores the attention keys and values for every token position already processed, so new tokens do not require re-processing the entire sequence. It turns inference from O(n²) per token to O(n) per token.

Why are output tokens more expensive than input tokens?

Output tokens come from the decode phase, which runs the model once per token and cannot batch nearly as efficiently as prefill. Input tokens come from prefill, which can amortize across a whole sequence in one parallel pass. The asymmetry is typically 3–10x.

Does Taskade manage inference for me?

Yes. Taskade's inference gateway auto-routes across 15+ frontier models from OpenAI, Anthropic, Google, and leading open-source providers, applies prompt caching, batching, and speculative decoding transparently, and charges in normalized credits so you do not have to think about per-model pricing.

Inference

The Two Phases of LLM Inference

The KV Cache

Batching

Latency Metrics That Matter

The Cost Side

Inference Optimizations in Practice

Inference in Taskade

Frequently Asked Questions About Inference

What is inference in AI?

Why is LLM output slow?

What is the KV cache?

Why are output tokens more expensive than input tokens?

Does Taskade manage inference for me?

Further Reading

Related Wiki Pages

Inference

The Two Phases of LLM Inference

The KV Cache

Batching

Latency Metrics That Matter

The Cost Side

Inference Optimizations in Practice

Inference in Taskade

Related Concepts

Frequently Asked Questions About Inference

What is inference in AI?

Why is LLM output slow?

What is the KV cache?

Why are output tokens more expensive than input tokens?

Does Taskade manage inference for me?

Further Reading

Related Wiki Pages