
Browse Topics
On this page (13)
Mixture of Experts
Definition: Mixture of Experts (MoE) is a neural network architecture that divides a model into multiple specialized sub-networks ("experts") and uses a gating mechanism to route each input to only the most relevant experts. This allows models to be much larger in total parameters while keeping computational cost manageable โ only a fraction of the model activates for any given input.
MoE has become a foundational architecture for frontier AI models. GPT-4 is widely reported to use MoE, Mistral's Mixtral models popularized open-source MoE, and Google's Switch Transformer demonstrated trillion-parameter MoE models as early as 2021.
Why MoE Matters
The core insight of MoE is that not every parameter needs to activate for every input. A question about cooking doesn't need the same neural pathways as a question about quantum physics. By routing inputs to specialized experts, MoE achieves:
- Larger models, lower cost โ A 1.8T-parameter MoE model might only activate 280B parameters per query, dramatically reducing inference cost compared to a dense model of the same total size
- Better specialization โ Individual experts can develop deep expertise in specific domains or tasks
- Faster training โ Sparse activation allows training larger models on the same hardware budget
- Scalability โ MoE scales more efficiently than dense transformers as model size increases
How MoE Works
The Architecture
A standard MoE layer replaces the feed-forward network in a transformer block with multiple parallel expert networks and a gating (router) mechanism:
- Input arrives โ A token embedding enters the MoE layer
- Router decides โ A lightweight gating network scores each expert's relevance to this input
- Top-K selection โ Only the top K experts (typically 2) are activated for this input
- Expert processing โ The selected experts process the input independently
- Weighted combination โ Expert outputs are combined using the router's weights to produce the final output
Key Components
Experts โ Each expert is a standard feed-forward neural network. In a model with 16 experts and top-2 routing, only 2 experts activate per token โ meaning only ~12.5% of expert parameters are used per inference step.
Router/Gating Network โ A small learned network that decides which experts handle each input. The router is trained jointly with the experts through a combination of the main task loss and load-balancing auxiliary losses.
Load Balancing โ Without intervention, routers tend to route most inputs to a few popular experts while others sit idle. Auxiliary loss functions and capacity constraints ensure balanced utilization across all experts.
MoE vs. Dense Models
| Property | Dense Model | MoE Model |
|---|---|---|
| Active parameters | All (100%) | Subset (~12-25%) |
| Total parameters | Limited by compute | Can be much larger |
| Inference speed | Proportional to total size | Proportional to active size |
| Specialization | All parameters serve all inputs | Experts specialize by input type |
| Memory | Lower (smaller total) | Higher (full model in memory) |
| Training efficiency | Straightforward | Requires load balancing |
| Examples | Llama 3.1 405B | Mixtral 8x22B, GPT-4 (reported) |
Notable MoE Models
| Model | Total Params | Active Params | Experts | Released |
|---|---|---|---|---|
| GPT-4 (reported) | ~1.8T | ~280B | 16 experts, top-2 | 2023 |
| Mixtral 8x7B | 46.7B | 12.9B | 8 experts, top-2 | 2023 |
| Mixtral 8x22B | 176B | 44B | 8 experts, top-2 | 2024 |
| Switch Transformer | 1.6T | ~1.6B | 2048 experts, top-1 | 2021 |
| DeepSeek-V3 | 671B | 37B | 256 experts, top-8 | 2024 |
MoE and the Future of AI
MoE is becoming the default architecture for frontier models because it breaks the linear relationship between model capability and inference cost. Key trends:
- Sparse-by-default โ New frontier models increasingly adopt MoE rather than dense architectures
- Fine-grained experts โ Moving from 8-16 large experts to hundreds of smaller, more specialized experts (as in DeepSeek-V3 with 256 experts)
- Open-source MoE โ Mixtral and DeepSeek have made MoE accessible to the open-source community
- Hardware optimization โ GPU manufacturers are optimizing hardware for sparse computation patterns common in MoE
Further Reading:
- What Is a Transformer? โ The base architecture that MoE extends
- What Is an LLM? โ How MoE powers the largest language models
Related Terms/Concepts
- Transformer: The base architecture. MoE replaces dense feed-forward layers in transformer blocks with sparse expert layers
- Large Language Models: The primary application of MoE architecture, enabling larger and more capable models
- Deep Learning: The broader field of neural network research that MoE architectures advance
- Fine-Tuning: Adapting pre-trained MoE models to specific tasks, which can selectively update individual experts
Frequently Asked Questions About Mixture of Experts
What is Mixture of Experts in AI?
Mixture of Experts (MoE) is a neural network architecture that splits a model into multiple specialized sub-networks (experts) and uses a router to activate only the most relevant experts for each input. This makes models larger and more capable while keeping inference costs manageable.
Why is MoE more efficient than dense models?
In a dense model, every parameter activates for every input. In MoE, only a fraction of experts activate per input (typically 2 out of 8-16). A 176B-parameter MoE model might only use 44B parameters per query โ delivering quality comparable to a much larger dense model at a fraction of the compute cost.
Does GPT-4 use Mixture of Experts?
While OpenAI has not officially confirmed the architecture, credible reports indicate GPT-4 uses a Mixture of Experts architecture with approximately 1.8 trillion total parameters and 16 experts, routing to 2 experts per token.
What is the difference between MoE and model ensembles?
Model ensembles run multiple complete models and combine their outputs. MoE uses a single model with shared components and specialized expert layers โ only activating the relevant experts per input. MoE is more parameter-efficient and is trained end-to-end as one model.