download dots
Inference

Inference

7 min read
On this page (15)

Definition: Inference is the runtime phase of a large language model โ€” the act of using a trained model to produce output from a new input. Training builds the model's weights; inference spends those weights to generate tokens one at a time. Every ChatGPT reply, every Claude thought, every Taskade agent response, every Genesis app build is an inference call somewhere.

Inference economics decide what AI products can ship. Training a frontier model is a multi-hundred-million-dollar event that happens once. Inference costs show up every single time a user sends a message. The cost curve of a successful AI product is dominated by inference, not training.

The Two Phases of LLM Inference

Prefill โ€” one pass Decode โ€” token by token Input tokens1000s at once Parallel attention KV cache populated Output token 1 Output token 2 Output token 3 ...

Prefill. The input prompt (thousands of tokens) is processed in parallel in a single forward pass through the transformer. The attention keys and values are cached for each layer and every position โ€” the KV cache. Prefill is compute-heavy and fast because it can use the full GPU in parallel.

Decode. Output tokens are generated one at a time. Each new token needs the full KV cache from all prior positions plus the new token's contribution. Decode is memory-bandwidth-bound, not compute-bound, and runs far below GPU peak. Generating a 1,000-token response runs the model 1,000 separate times.

This split is why short prompts with long responses are expensive per character, and long prompts with short responses are cheap per character. The economics are asymmetric.

The KV Cache

The KV cache is the single most important optimization in LLM inference. Without it, every new token would require re-processing the entire sequence. With it, only the new token's contribution is computed.

Without KV cache: O(nยฒ) per token   โ†’  10x slower
With KV cache:    O(n) per token    โ†’  standard

The KV cache grows linearly with sequence length. A 128K-context model with a batch of 10 concurrent conversations needs gigabytes of VRAM just for caches. This is why context-window extensions get harder the longer they stretch and why prefix caching (reusing KV entries for common system prompts) matters so much in production.

Anthropic's prompt caching feature, OpenAI's cached input tokens, and Google's context caching all expose this to developers: pay once for a prefix, reuse it across thousands of requests.

Batching

A GPU running a single conversation wastes ~95% of its compute. Inference servers pack many conversations together โ€” batching โ€” to amortize the memory-bandwidth bottleneck.

Static batching โ€” Wait until N requests arrive, then run them together. Simple but adds latency.

Continuous batching (iteration-level batching). Requests enter and leave the batch every decode step. The throughput gain is typically 3โ€“10x with no extra latency. This is the default in vLLM, TGI, and most production LLM servers.

Speculative decoding. A small, cheap "draft" model proposes the next few tokens; the big model verifies them in one prefill-style pass. When the draft is accepted, you get 2โ€“4 tokens for the cost of one. Used by every major frontier vendor in 2026.

Latency Metrics That Matter

Two metrics define inference UX:

Time to First Token (TTFT). Seconds from request to the first streamed token. Dominated by prefill and queuing. Long prompts increase TTFT linearly.

Tokens Per Second (TPS). Throughput of the decode phase. On frontier hardware, modern frontier models run at 50โ€“200 TPS per request. Streaming TPS below 40 feels sluggish; above 100 feels instant.

Together these define the perceived speed:

Response time = TTFT + (output_tokens / TPS)

Example:
  prompt:         4K tokens
  TTFT:           0.6 s
  output:         500 tokens
  TPS:            80
  total latency:  0.6 + 500/80 = 6.85 s

The Cost Side

Inference cost scales with:

Factor Effect on Cost
Input tokens Linear โ€” prefill compute
Output tokens Linear โ€” decode time + memory
Context window size Linear in KV cache memory
Model parameters Roughly linear in compute
Batch size Inverse โ€” more sharing = cheaper
Cache hit rate Inverse โ€” fewer fresh prefills

Frontier models in 2026 charge roughly 3โ€“10x more per output token than per input token, because decode is the expensive half. Long-output workloads like code generation or Genesis app builds feel the output tax sharply.

Inference Optimizations in Practice

Modern inference stacks layer five or more optimizations:

  1. Continuous batching โ€” pack requests together
  2. Paged attention (PagedAttention) โ€” virtual-memory-style KV cache layout, reduces fragmentation
  3. Speculative decoding โ€” draft + verify for 2โ€“4x speedup
  4. FlashAttention / FlashDecoding โ€” memory-efficient attention kernels
  5. Quantization โ€” 8-bit or 4-bit weights, 2โ€“4x memory saved at 1โ€“2% quality loss
  6. KV cache sharing โ€” prefix caching across requests
  7. Structured output constrained decoding โ€” force outputs to a grammar or schema

The combination turns a 70B-parameter model from "too expensive" into "cost-effective" at production scale.

Inference in Taskade

You never manage inference inside Taskade. The platform auto-routes between 11+ frontier models from OpenAI, Anthropic, and Google, dispatching requests to whichever model fits the task and the user's plan. All the optimizations above โ€” prompt caching, continuous batching, speculative decoding โ€” happen inside the upstream providers and the Taskade inference gateway.

What you see is credits. A credit absorbs the full inference cost (input + output + cache behavior) normalized across models, so switching from one frontier model to another does not require you to re-architect anything. Business and Max plans have different credit bursts precisely because the underlying inference profile differs โ€” Business favors request burst, Max favors credit burst for long-running Genesis builds.

Frequently Asked Questions About Inference

What is inference in AI?

Inference is the runtime phase of a large language model โ€” using a trained model to produce output from a new input. Training builds the weights; inference spends them to generate tokens one at a time.

Why is LLM output slow?

LLM decode is memory-bandwidth-bound, not compute-bound. Each new token requires reading the full KV cache, one token at a time. Modern systems use continuous batching and speculative decoding to lift throughput, but the sequential nature of autoregressive generation is fundamental.

What is the KV cache?

The KV cache stores the attention keys and values for every token position already processed, so new tokens do not require re-processing the entire sequence. It turns inference from O(nยฒ) per token to O(n) per token.

Why are output tokens more expensive than input tokens?

Output tokens come from the decode phase, which runs the model once per token and cannot batch nearly as efficiently as prefill. Input tokens come from prefill, which can amortize across a whole sequence in one parallel pass. The asymmetry is typically 3โ€“10x.

Does Taskade manage inference for me?

Yes. Taskade's inference gateway auto-routes across 11+ frontier models from OpenAI, Anthropic, and Google, applies prompt caching, batching, and speculative decoding transparently, and charges in normalized credits so you do not have to think about per-model pricing.

Further Reading