BlogAIAI World Models Explained:…

AI World Models Explained: History, JEPA, Inference Scaling & the Race to Goal-Directed AI (2026)

June 7, 2026Updated July 26, 202628 min readJohn XieAI·#ai-agents #world-models #inference-scaling

On this page (35)

In 2026, Yann LeCun left Meta after 10 years to raise $1.03 billion for a single purpose: training world models. Not language models. Not image generators. World models, AI systems that learn the dynamics of the world itself, predicting what happens next when you take an action.

That bet is either the future of AI or a $1B detour. Understanding why LeCun made it, and what world models actually are, is one of the most important questions in AI right now.

This is the complete guide.

TL;DR: World models predict "given this state + this action, what comes next?", enabling AI agents to plan, adapt, and reason about the future rather than react to the present. From Richard Sutton's 1990 definition through JEPA, V-JEPA 2, and the $1B AMI Labs raise, world models are reshaping robotics, autonomous vehicles, and knowledge-work AI. Inference-time scaling means the faster a model can run, the smarter it can be. Taskade Genesis implements this loop in your workspace, try a live agent app →

🗺️ World Models at a Glance (2026)

Concept	One-Line Definition
World model	Neural network predicting next state given current state + action
JEPA	Predict next latent embedding (not raw pixels) to avoid wasting capacity
SIGG regularizer	Enforce Gaussian latent distribution to prevent representation collapse
V-JEPA 2	Meta's video world model: 1M hrs training data, 80% zero-shot robotics success
DMPC	Diffusion Model Predictive Control — factorized world model for novel dynamics
Inference-time scaling	Better answers by computing longer, not training bigger
Speculative decoding	Small model drafts tokens; big model verifies in parallel — 2-3× speedup
SSD	Speculative Speculative Decoding — drafting + verification in parallel → 300 tok/s
Model-free	Observation → policy → action (no explicit future prediction)
Model-based	Observation → world model → imagined futures → plan → action
Workspace DNA	Memory + Intelligence + Execution — the world model loop for knowledge work

🤔 What Is a World Model?

A world model is a neural network that answers one question: given the current state of a system and an action I'm about to take, what will the state look like next?

Formally: f(observation, action) → next observation

That might sound like a small step beyond a standard neural network. It isn't. A model that can predict the consequences of actions has, by construction, an internal model of the world's rules. It understands physics. It understands causality. It can plan.

The definition is not new. In 1990, Richard Sutton, the father of reinforcement learning, described it at a NIPS workshop:

"A black box that takes as input its situation and the action it is going to execute and outputs a prediction of its immediate next situation."

That sentence is a complete specification of a modern world model, written 36 years ago. What changed is scale, data, and compute. The neural networks of 1990 could barely fit a toy environment. Today's world models train on a million hours of video.

Three Capabilities World Models Enable

Build a good world model and three things become possible that aren't possible without one:

Imagined rollouts let an agent mentally simulate "if I do X, Y, Z, what happens?", playing out futures faster than they can occur in the real world. This is the same cognitive machinery that lets you imagine catching a ball before you move.

Model-based control (MPC) uses those imagined rollouts to score action sequences and pick the best one. No reward function needed at train time, just a world model and an objective at test time.

Surprise quantification measures how well the world model predicted what actually happened. High surprise = out-of-distribution input = time to slow down, ask for human oversight, or switch strategies. This is one of the most underrated capabilities in safety-critical AI.

📜 The Complete History of World Models (1990–2026)

The 1990s–2013: The Concept Era

Richard Sutton's 1990 description is the origin point, but the idea was already in the air. Kenneth Craik's 1943 theory of mental models proposed that the brain builds small-scale models of reality for anticipation and planning. Sutton formalized this for reinforcement learning: an agent that can predict the next state of the world can plan without trying every action for real.

The Dyna architecture (Sutton, 1991) was the first practical implementation, using a world model to generate synthetic "imagination" data for policy training alongside real experience. The gap between concept and capability was vast; Dyna worked on tiny tabular environments.

The 2012 ImageNet moment (AlexNet, Ilya Sutskever, Alex Krizhevsky, Geoffrey Hinton) changed the landscape. Deep neural networks could suddenly perceive the world at human level. The building blocks for learning a world model from high-dimensional inputs were in place.

2015–2019: The Deep RL Era

Google Brain's PlaNet (2019) was the first model to plan entirely in a compact learned latent space, predicting the future not in pixel space but in an abstract representation. The key innovation: a Recurrent State Space Model (RSSM) that maintained uncertainty estimates across time.

Danijar Hafner's DreamerV1 (2019) built on PlaNet with an actor-critic policy trained entirely in imagination. The agent learned to play visual control tasks without ever needing to interact heavily with the real environment. "World Models" (Ha & Schmidhuber, 2018) had popularized the idea with a vision model (V), memory module (M), and controller (C); Dreamer made it state-of-the-art.

2020–2023: Scale and Architecture

DreamerV2 (2020) introduced discrete latent representations using categorical variables, a counterintuitive choice that improved stability and enabled the model to learn sharper, more distinct world states. DreamerV3 (2022) achieved something remarkable: a single set of hyperparameters that worked across 7 completely different domains, Atari, continuous control, 3D navigation, Minecraft, without any domain-specific tuning. This was the first hint that a universal world model was plausible.

Yann LeCun published his JEPA (Joint Embedding Predictive Architecture) paper in 2022, arguing that the field had been wasting enormous capacity by having models predict in pixel space (what exactly does the next video frame look like?) rather than in abstract representation space (what is the essence of the change?). The paper also introduced a philosophical argument: intelligence requires building models of the world, not just pattern-matching on text.

Google DeepMind's Genie (2023) demonstrated something new: a world model trained purely on internet gameplay video, with no action labels, could infer the latent action space and enable interactive control of novel environments. Text-to-playable-world. One frame per second, but the concept was there.

2024–2026: The Billion-Dollar Era

The release of V-JEPA (2024) showed that video JEPA models could learn powerful representations for physical reasoning. But the decisive moment was when Google DeepMind released Genie 2 in late 2025, generating 3D interactive worlds at 24 FPS from a single image, with consistent physics. Suddenly world models weren't a research curiosity; they were production-grade.

Yann LeCun left Meta in early 2026 to co-found AMI Labs with $1.03 billion in funding specifically to build general-purpose world models. The same month, Meta released V-JEPA 2, trained on 1 million hours of internet video, fine-tuned on 62 hours of robot interaction data, achieving ~80% success on zero-shot robotic manipulation tasks.

The world model race was fully underway.

🧩 How World Models Actually Work: The Architecture

The Core Loop

Every world model implements some version of this loop:

This is Model Predictive Control (MPC) with a receding horizon: predict H steps ahead, pick the best first action, execute it, re-observe, and repeat.

The Representation Problem

The hardest part isn't the prediction. It's the representation. World models must learn two things simultaneously:

A compact representation of the high-dimensional input (image, video, sensor array)
Dynamics, how that representation changes under actions

When you optimize both jointly, the training landscape has a devastating attractor: representation collapse. The model discovers that mapping every input to the same embedding makes prediction trivially easy (next state = current state = zero), drives the loss to zero, and is completely useless.

The field has converged on three families of collapse-prevention strategies:

Approach	Mechanism	Examples	Trade-off
Explicit heuristics	Enforce statistical properties in latent space	VICReg, BYOL, SimSiam, SIGG	Architectural complexity or extra hyperparameters
Foundation bootstrapping	Pre-train representation, then add dynamics	V-JEPA, Genie, DMPC	Depends on quality of base model
Privileged supervision	Use labels/rewards not available at inference	Dreamer (reward signal), DINO	Requires expensive labeled data

🔬 JEPA: Predict the Idea, Not the Pixels

JEPA is LeCun's answer to the representation problem. The key insight: predicting in latent space is fundamentally different from predicting in pixel space.

When a model predicts what the next video frame will look like pixel-by-pixel, it burns enormous capacity on texture, lighting, background, details that are irrelevant to understanding what's happening. A marble rolling on a table: the dynamics are simple, but a pixel-space prediction must reproduce every shadow, every reflection.

JEPA says: learn a good encoder, predict only in the encoder's latent space.

Why the Target Encoder Matters

The target encoder, an exponential moving average (EMA) of the main encoder, provides stable learning targets that don't collapse. If both the encoder and predictor could freely adjust, they'd converge to the trivial solution together. The EMA encoder moves slowly, giving the predictor a stable target to chase.

V-JEPA 2 adds 3D Rotary Position Embeddings (3D-RoPE) to handle the temporal dimension at billion-parameter scale, standard positional encodings destabilize training at this size. The model processes 64-frame video into 8,192 spatio-temporal patches × 1,024-dimensional embeddings.

SIGG: One Regularizer to Rule Them All

The Lay World Model (from LeCun's group at NYU) simplifies collapse prevention to a single differentiable term: SIGG (Sketching, Isotropic, Gaussian).

The idea: if you take many 1D projections (sketches) through the batch of latent embeddings, and each projection looks Gaussian, then the joint distribution must be approximately isotropic Gaussian, which means the latent space is "healthy" (spread out, non-degenerate).

SIGG is cheap to compute, requires one hyperparameter, and achieves comparable stability to momentum encoders and EMA tricks without the architectural complexity. The Isaac Ward presentation at YC Paper Club framed this as "one elegant regularization term", and the empirical results back it up.

⚖️ Model-Free vs. Model-Based: The Live Industry Battle

This isn't a settled question. In 2026, both paradigms are deployed at scale and the debate is genuinely alive.

The Case for Model-Free

Model-free approaches are simpler, faster to iterate, and surprisingly capable. The GPT family, the Claude family, Llama, all model-free at inference time (though reasoning models like o3 add test-time compute that partially simulates planning). There is growing empirical evidence that model-free networks implicitly learn world models in their weights. But these internal models are obfuscated, not interpretable, and not explicitly leveraged for planning.

Model-free agents show brittleness to out-of-distribution inputs, the same model that writes production code can make elementary errors on slight variations. This "jaggedness" problem is a consistent finding.

The Case for Model-Based

The decisive advantage of model-based approaches is factorization. Stannis' DMPC work (Google DeepMind) demonstrated this cleanly: when a robot encounters novel dynamics (a broken ankle joint), a factorized model, action proposal (frozen) + dynamics model (retrained), recovers most of its performance after retraining only the dynamics model on a small play dataset. A joint model has to retrain everything.

The second advantage: arbitrary reward functions at test time. A world model learned on locomotion data can optimize for completely novel objectives (jumping patterns never seen in training) by swapping the test-time reward function. This is a powerful generalization property.

Property	Model-Free	Model-Based
Training simplicity	✅ Simple end-to-end	⚠️ Complex co-learning
Inference speed	✅ Fast (single forward pass)	⚠️ Slower (planning rollouts)
Novel dynamics adaptation	❌ Full retrain needed	✅ Retrain only dynamics model
Novel reward adaptation	❌ Reward must be in training	✅ Any reward at test time
Modeling error quantification	❌ Opaque	✅ Explicit uncertainty
Data efficiency	⚠️ Needs large datasets	✅ Better sample efficiency
Biological precedent	⚠️ Unclear	✅ Human cognition uses WMs

⚡ Inference-Time Scaling: When Speed = Intelligence

This is where world models and modern LLM infrastructure intersect in a way most people haven't fully processed.

The standard assumption is that inference is an implementation detail. You train the model, then you run it. Cost and latency are engineering concerns. But there's a more fundamental framing: if a model's performance scales with how much compute it uses at inference time, then tokens per second equals peak intelligence.

This is not hypothetical. OpenAI's o1, o3, and o4-mini series, Google's Gemini 2.0 Flash Thinking, DeepSeek-R1, all improve dramatically when given more time to think. More thinking = more tokens = more compute. The chain of thought is the work.

For world models, this compounding is even stronger: each step of a planning rollout is an inference call. A world model planning 50 steps ahead makes 50× the inference calls of a single-step model. Make inference faster, you get deeper planning for the same cost.

Speculative Decoding: The 2-3× Speedup

Transformers have a deep asymmetry: they can verify a token sequence's probability in one parallel forward pass, but they can only generate tokens one at a time. Speculative decoding exploits this:

A small draft model auto-regressively generates N candidate tokens (N sequential passes on a small model)
The large target model runs one forward pass over all N tokens to compute their probabilities
Tokens that the target "would plausibly have generated" are accepted; the rest are rejected
At the rejection point, the target samples a bonus token for free using its already-computed distribution

The result: you get the output quality of the large model at roughly the cost of the small model for accepted tokens. Typical speedups: 2-3×.

Speculative Speculative Decoding (SSD): Hiding the Drafting Latency

Presented at the first YC Paper Club by Tanishk (Stanford), SSD removes the remaining bottleneck: the sequential dependency between rounds. In vanilla speculative decoding, round t+1 can't start until round t's verification is known.

SSD runs drafting and verification on separate hardware simultaneously:

Cache hit rate: 80–90%. When you correctly predict the verification outcome 80-90% of the time, the drafting latency is almost fully hidden. The net result: 300 tokens/second for Llama 370B on 4× H100s, 2× faster than SGLang with standard speculative decoding, winning on both latency and throughput.

This is inference as capability, not inference as cost. An entire data center running the Riemann Hypothesis hypothesis. The speed of thinking is the intelligence ceiling.

🤖 World Models for Robotics: The 2026 Deployment Picture

V-JEPA 2: From Internet Video to Robotic Manipulation

Meta's V-JEPA 2 is the clearest demonstration that internet-scale video pretraining transfers to physical manipulation:

Training Stage	Data	What's Learned
Stage 1 (pretraining)	VideoMix22M — 1M+ hours of internet video	Physical intuitions: gravity, occlusion, object permanence, cause-effect
Stage 2 (fine-tuning)	62 hours of Droid robot dataset (unlabeled)	Action-conditioned dynamics in a specific robot's physical space

The action-conditioned version (V-JEPA 2-AC) enables zero-shot model predictive control. Given a goal image, it defines an energy function as the L1 distance between the predicted latent and the goal latent. The Cross-Entropy Method (CEM) optimizes over candidate action sequences to minimize this energy:

Sample K action sequences from a proposal distribution
Roll each through the world model (H steps ahead)
Score each trajectory: lower energy = closer to goal
Update the proposal distribution toward the top-performing sequences
Execute only the first action; re-observe; repeat

This achieves ~80% success on cup-lifting and placement tasks in zero-shot, no task-specific training, no reward shaping, no demonstration data. The world model's understanding of physics is doing all the work.

DMPC: Factorized World Models for Novel Conditions

Google DeepMind's Diffusion Model Predictive Control uses diffusion models for both the action proposal and the dynamics model. The choice of diffusion is deliberate: diffusion models capture multi-modal distributions naturally, which is exactly what robot behavior looks like (there are many valid ways to achieve a goal).

The factorized architecture's killer application:

When the environment's physics change, a broken joint, a slippery surface, an attached payload, only the dynamics model needs updating. Ten minutes of play data in the new environment is enough to recover performance. No full retrain. No new demonstrations. The action space is still the same; only the consequences changed.

🏢 The 2026 World Model Competitive Landscape

Company	System	Key Strength	2026 Status
Meta / AMI Labs	V-JEPA 2	Video pretraining + zero-shot robotics	V-JEPA 2 released May 2026; AMI Labs raised $1.03B Jan 2026
Google DeepMind	Genie 2, DMPC	3D world generation (24fps); robot control	Genie 2 deployed; DMPC published
Nvidia	Alpamayo	Physical AI for AV rare scenarios	Uber & Mobileye robotaxis planned 2026
World Labs	Spatial intelligence WM	3D spatial reasoning	~$500M raised; stealth
Runway	Gen-3 Alpha Turbo	Creative video world model	Production deployment
Wayve	GAIA-1	Autonomous driving WM	UK deployment
Waymo	Internal WM	AV simulation + planning	250K+ weekly paid rides (2026)

🧭 Two Camps: "Compress to Understand" vs "Render to Predict"

The clearest way to understand the 2026 field is that it split into two philosophies that rarely get diagrammed side by side. One camp predicts in abstract representation space (compress the world to its meaning, then predict what matters); the other generates pixels (render a plausible future frame by frame). Both are "world models" — they just disagree about where the modeling should happen.

The distinction matters because it predicts what each model is good for. "Render to predict" systems (Sora, Genie) excel at generating interactive, watchable worlds. "Compress to understand" systems (JEPA, Dreamer) don't waste capacity drawing every pixel — they predict only the abstract outcomes that matter for planning, which is why LeCun bets they'll scale further toward reasoning and robotics. As Saining Xie, co-founder of LeCun's world-model lab, frames the point: "The goal isn't to generate pretty videos — it's to understand the world."

🗂️ A Functional Taxonomy: Renderer, Simulator, Planner

In June 2026, Fei-Fei Li proposed a sharper way to sort world models by what they're for, and it slots the whole 2026 landscape cleanly into three functions:

Function	What it does	2026 examples
Renderer	Generates realistic worlds you can look at and move through	Genie 3, Sora 2
Simulator	Produces a persistent, editable 3D environment	Marble (World Labs)
Planner	Predicts consequences of actions to choose the best one	V-JEPA 2, Dreamer

A renderer makes a world visible; a simulator makes it editable; a planner makes it actionable. The most valuable systems for embodied AI live in that third column — which is exactly where the JEPA camp is aiming.

🔴 The People and Money: LeCun's AMI Labs and Saining Xie

The world-model bet is now a billion-dollar one. In March 2026, AMI Labs (Advanced Machine Intelligence) — co-founded by Turing laureate Yann LeCun with Saining Xie as chief science officer — reportedly raised a $1.03 billion seed to build world models as a general capability engine, one of the largest seed rounds in AI history. Separately, Fei-Fei Li's World Labs raised $1 billion in early 2026 and shipped Marble, its first commercial world model.

Saining Xie is worth knowing here: a computer-vision scientist who left an NYU professorship to co-found AMI, he authored DiT (the Diffusion Transformer) — the backbone later adopted by OpenAI's Sora — along with ConvNeXt and much of the V-JEPA line. His thesis is deliberately contrarian: that learned representation is the single most important thing in AI, and that today's language models "lack a grounded world model." He relays LeCun's blunt version of the argument:

"Everyone is just using a crutch — the language model itself. You can walk, but you can't run, can't participate in the Olympics, because the leg of visual representation is still not good enough."

— Yann LeCun, as told by Saining Xie

Two Turing Laureates, Opposite Bets

Here's the connect-the-dots twist. Yann LeCun and Yoshua Bengio shared the 2018 Turing Award (with Geoffrey Hinton) — and both are now building world models, for opposite reasons. LeCun builds world models as a capability engine (make AI smarter and more grounded). Bengio builds a world model as a safety brake — his Scientist AI is a non-agentic predictor designed to veto the harmful actions of autonomous agents. Same core idea — model the world — opposite purpose. It's one of the most illuminating splits in AI right now, and it maps neatly onto the broader tension between racing to capability and building AI guardrails.

World-model refresh, 2026: this section adds the two-camps framing, Fei-Fei Li's renderer/simulator/planner taxonomy, and the AMI Labs / World Labs funding picture. Funding figures come from public reporting and move fast — verify before quoting.

🧬 World Models for Knowledge Work: Workspace DNA

World models aren't only for robots. The same loop, observe state, predict consequences of actions, plan, execute, update, applies to any dynamic system. Including your workspace.

Taskade's Workspace DNA implements this loop in the knowledge-work domain:

The structure is identical:

Memory = the workspace state. Every project, document, agent instruction, and completed task is a state representation of your organization's current situation.
Intelligence = the world model. When you ask EVE "what should the team focus on this week?", it's predicting the optimal action given the current workspace state.
Execution = the action applied to the environment. Automations trigger, agents run, outputs write back to Memory, updating the state.

Each time an agent completes a task, the workspace becomes a better model of how your work actually functions. Memory feeds Intelligence. Intelligence triggers Execution. Execution creates Memory. That's the self-reinforcing world model loop, running in your team's workspace.

The Genesis agents you build are action-conditioned predictors: trained on your specific documents, they learn what outcomes different actions produce in your context. That's not just prompt-stuffing. It's domain-specific world modeling. Browse the community gallery to clone live examples and see the loop in action.

Clone a live Taskade Genesis agent app →, see Workspace DNA in action in under five minutes.

🔭 Where World Models Are Heading

The Three Frontier Questions

1. How much video data is enough?

V-JEPA 2 used 1 million hours of internet video and achieved 80% zero-shot success on robotic manipulation. Genie 2 generated 3D worlds from single images. The pattern: more video → better physical intuitions. The question is whether internet video covers the long tail of physical scenarios needed for general embodied AI.

2. Can language and world models be unified?

V-JEPA 2 already combines a language model for goal specification with a video world model for physical grounding. The next step: a single architecture that handles both, with the language model querying the world model's planning capabilities rather than generating text that describes plans.

3. Is inference-time planning the missing piece?

The SSD paper's framing is provocative: inference speed equals intelligence ceiling. For world models, this compounds, each MPC rollout step is an inference call. The robot that can run planning 2× faster can look 2× further ahead in the same time budget. SSD-style parallelization of world model inference may be as important as the model architecture itself.

The Capability Convergence

The convergence of better world models (JEPA, DreamerV3), faster inference (speculative decoding, SSD), and richer training data (internet video, robot play) is pointing toward agents that don't just respond to prompts but plan ahead, adapt to novel conditions, and maintain coherent goals over extended time horizons.

LeCun's $1B bet isn't a gamble on a long shot. It's a bet on a specific architectural direction, JEPA, which has already demonstrated strong empirical results in robotics and video understanding. The question isn't whether world models will matter. It's which architecture, which training regime, and which application domain will be first to demonstrate general-purpose goal-directed AI.

📊 Quick Reference: Key Papers and Systems

Year	Paper / System	Authors	Key Contribution
1990	World Models (concept)	Richard Sutton	Original definition: state + action → next state
1991	Dyna	Richard Sutton	First RL + imagination hybrid
2018	World Models	Ha & Schmidhuber	V+M+C architecture; latent rollouts in RL
2019	PlaNet	Hafner et al. (Google Brain)	Planning in latent space with RSSM
2019	DreamerV1	Hafner et al.	Actor-critic entirely in imagination
2022	DreamerV3	Hafner et al.	Universal hyperparameters across 7 domains
2022	JEPA	Yann LeCun	Predict next latent, not next pixel
2023	Genie	Google DeepMind	Video-to-interactive-world, no action labels
2024	V-JEPA	Meta	Action-conditioned video JEPA
2024	DMPC	Google DeepMind (Stannis et al.)	Diffusion for action proposal + dynamics
2025	Lay World Model	LeCun's group (Isaac Ward et al.)	SIGG regularizer — one term, no collapse
2025	Genie 2	Google DeepMind	24fps 3D world generation from one image
2025	SSD	Tanishk, Triau, Aar May (Stanford)	Parallel draft + verify → 300 tok/s
2026	V-JEPA 2	Meta	1M hrs video + robot FT → 80% zero-shot
2026	AMI Labs	Yann LeCun	$1.03B to scale JEPA to general WMs

🚀 Getting Started with World Models in Your Workflow

You don't need a $1B lab to benefit from world-model-style AI. Taskade Genesis brings the core loop, observe, predict, act, update, to any team workflow:

Build a Genesis agent trained on your project data, documentation, and past decisions. This is your Memory layer, the workspace state representation.
Ask EVE to plan a work sequence or evaluate options. This is the Intelligence layer, the world model reasoning over state to predict best actions.
Set up automations that execute on agent recommendations and write results back to projects. This is the Execution layer, actions updating the world model's training data.
Watch the workspace improve as the agent accumulates evidence about what works in your specific context. Memory feeds Intelligence. Intelligence triggers Execution. Execution creates Memory.

The loop runs every day. The workspace gets smarter every week.

Start building with Taskade Genesis → | Browse live agent apps → | See the agents platform →

▲ ■ ● Memory feeds Intelligence, Intelligence triggers Execution, Execution creates Memory, the world model loop, running in your workspace.

Sources: Richard Sutton (1990 NIPS workshop); Ha & Schmidhuber, "World Models" (2018); Hafner et al., "DreamerV3" (2022); Yann LeCun, "A Path Towards Autonomous Machine Intelligence" (2022); Google DeepMind Genie paper (2023); Stannis et al., "DMPC" (Google DeepMind, ~2024); Isaac Ward et al., "Lay World Model" (NYU/LeCun group, 2025); Tanishk et al., "Speculative Speculative Decoding" (Stanford, 2025); Meta FAIR, "V-JEPA 2" (May 2026); AMI Labs funding announcement (Jan 2026); YC Paper Club Session 1, Woodside CA (2026).

Image: Yann LeCun, 2018, photo by Jérémy Barande (École polytechnique / Institut DATAIA), Wikimedia Commons, CC BY-SA 2.0.

Frequently Asked Questions

What is a world model in AI?

A world model is a neural network that predicts how a system's state will change given an action: f(observation, action) → next observation. Unlike language models that predict the next word, world models predict the next state of a physical or abstract environment. The concept dates to Richard Sutton's 1990 NIPS workshop paper. Modern world models power robotics (V-JEPA 2), game simulation (Genie), and AI agent planning, including Taskade Genesis, where the Workspace DNA loop (Memory → Intelligence → Execution) functions as a world model for collaborative work.

What is JEPA and how is it different from other world models?

JEPA (Joint Embedding Predictive Architecture) is Yann LeCun's world model framework that predicts the next latent embedding rather than the next pixel. Standard autoencoders reconstruct the input; diffusion models generate full images; JEPA instead learns a compact abstract representation and predicts how that representation changes under an action. This avoids spending model capacity on irrelevant texture details. LeCun's company AMI Labs raised $1.03B in early 2026 to scale JEPA to general world modeling for robotics and embodied AI.

What is inference-time scaling and why does it matter?

Inference-time scaling (also called test-time compute) means allocating more compute at the moment of inference, letting the model think longer, try more candidate solutions, or run more rollouts, rather than training a bigger model. OpenAI's o1/o3 series, Google's Gemini 2.0 Flash Thinking, and DeepSeek-R1 all use inference-time scaling. The key insight: if a model's performance scales with how much it thinks, then tokens per second equals peak intelligence. Speculative decoding (SSD paper) compounds this, parallelizing drafting and verification to achieve 300+ tokens/sec on 370B models.

What is speculative decoding?

Speculative decoding uses a small draft model to propose multiple tokens, which a large target model verifies in a single parallel forward pass. Because transformers can verify token probabilities for a whole sequence at once (but must generate them one by one), this exchange of extra compute for lower latency achieves 2-3x speedups. Speculative Speculative Decoding (SSD), presented at the first YC Paper Club, extends this by parallelizing the drafting and verification phases across separate hardware, achieving 300 tokens/sec for Llama 370B on 4 H100s with an 80-90% cache hit rate.

What is the difference between model-free and model-based AI?

Model-free AI maps observations directly to actions through a neural network with no explicit representation of future states (e.g., standard policy gradient RL, end-to-end transformers). Model-based AI trains a world model and uses it to plan, imagining what will happen before acting. Model-based approaches can quantify modeling error, adapt to novel dynamics by updating just the world model (not the policy), and use arbitrary reward functions at test time. The tradeoff: model-based needs an explicit action-proposal mechanism and is more complex to train stably.

What is representation collapse in world model training?

Representation collapse is the failure mode where a world model maps every input to the same (or very similar) latent embedding. The loss goes to zero, predicting the next state is trivially easy if every state looks the same, but the model is useless. It occurs because co-learning representation and dynamics creates degenerate attractors. Solutions include stop-gradient tricks (BYOL, SimSiam), VQ-VAE codebooks, EMA teacher encoders (Bootstrap Your Own Latent), variance-covariance regularizers (VICReg), and the SIGG regularizer (Lay World Model), which enforces Gaussian distribution in latent space using cheap 1D projections.

What is the SIGG regularizer in the Lay World Model?

SIGG stands for Sketching, Isotropic, Gaussian. The Lay World Model (from Yann LeCun's group) adds a single loss term that checks whether the batch of latent embeddings looks like an isotropic Gaussian. It does this by taking many 1D projections (sketches) of the high-dimensional latent space and checking that each projection follows a Gaussian distribution. If they all do, the joint distribution must be approximately Gaussian, meaning the latent space is healthy and non-collapsed. This is computationally cheap and requires only one hyperparameter, compared to the architectural complexity of momentum encoders or codebooks.

What is V-JEPA 2 and what can it do?

V-JEPA 2 is Meta's second-generation video JEPA world model, trained on 1 million hours of internet video (VideoMix22M) and fine-tuned on 62 hours of robot interaction data. It uses Vision Transformers with 3D Rotary Position Embeddings. The action-conditioned version (V-JEPA 2-AC) enables Model Predictive Control for robotics: given a goal image, it defines an energy function in latent space and uses the Cross-Entropy Method to optimize action sequences. In zero-shot generalization tests, it achieved ~80% success on cup-lifting and placement tasks without task-specific training.

What is Diffusion Model Predictive Control (DMPC)?

DMPC (from Google DeepMind) uses diffusion models for both the multi-step action proposal and the multi-step dynamics model in a Model Predictive Control framework. The key advantage: factorizing action proposal and dynamics model separately allows adapting to novel dynamics (a robot with a broken ankle, new physics) by retraining only the dynamics model, keeping the action proposal frozen. Multi-step diffusion dynamics also reduces compounding error compared to single-step rollouts. Empirically, stronger modeling through diffusion simplifies the planner: a naive sample-based planner outperforms prior complex planning algorithms.

How is Taskade's Workspace DNA related to world models?

Taskade's Workspace DNA, Memory + Intelligence + Execution, implements the world model loop in the knowledge work domain. Memory captures workspace state (project history, documents, agent knowledge). Intelligence predicts what action to take next (EVE and Genesis agents reason over that state). Execution applies the action and updates Memory, creating a self-improving loop. This mirrors the observe-predict-act-update cycle of a world model, applied to team workflows instead of physical environments. Every Genesis agent you train makes the workspace's internal model of your work more accurate.

Who are the key players in the world model race in 2026?

The 2026 world model race includes: Meta (V-JEPA 2, video world model for robotics, 1M hours training data), AMI Labs/Yann LeCun ($1.03B raise to scale JEPA), Google DeepMind (Genie 2, text-to-playable-world at 24FPS; DMPC for robot control), Nvidia (Alpamayo, physical AI for rare autonomous vehicle scenarios), World Labs (spatial intelligence startup, $500M+ raise), and Runway (creative video world models). Every major AI lab now has a world model research program.

Can world models replace large language models?

World models and LLMs serve different purposes. LLMs excel at language understanding, generation, and knowledge retrieval. World models excel at predicting physical state transitions, enabling planning under novel conditions, and grounding reasoning in physical constraints. The frontier is hybrid architectures: LLMs provide language understanding and goal specification, while world models provide physical grounding and planning. V-JEPA 2 already uses an LLM for language alignment on top of the video world model. The future is likely multimodal systems that combine both.