download dots
Mixture of Experts

Mixture of Experts

6 min read
On this page (13)

Definition: Mixture of Experts (MoE) is a neural network architecture that divides a model into multiple specialized sub-networks ("experts") and uses a gating mechanism to route each input to only the most relevant experts. This allows models to be much larger in total parameters while keeping computational cost manageable โ€” only a fraction of the model activates for any given input.

MoE has become a foundational architecture for frontier AI models. GPT-4 is widely reported to use MoE, Mistral's Mixtral models popularized open-source MoE, and Google's Switch Transformer demonstrated trillion-parameter MoE models as early as 2021.

Why MoE Matters

The core insight of MoE is that not every parameter needs to activate for every input. A question about cooking doesn't need the same neural pathways as a question about quantum physics. By routing inputs to specialized experts, MoE achieves:

  • Larger models, lower cost โ€” A 1.8T-parameter MoE model might only activate 280B parameters per query, dramatically reducing inference cost compared to a dense model of the same total size
  • Better specialization โ€” Individual experts can develop deep expertise in specific domains or tasks
  • Faster training โ€” Sparse activation allows training larger models on the same hardware budget
  • Scalability โ€” MoE scales more efficiently than dense transformers as model size increases

How MoE Works

The Architecture

A standard MoE layer replaces the feed-forward network in a transformer block with multiple parallel expert networks and a gating (router) mechanism:

  1. Input arrives โ€” A token embedding enters the MoE layer
  2. Router decides โ€” A lightweight gating network scores each expert's relevance to this input
  3. Top-K selection โ€” Only the top K experts (typically 2) are activated for this input
  4. Expert processing โ€” The selected experts process the input independently
  5. Weighted combination โ€” Expert outputs are combined using the router's weights to produce the final output

Key Components

Experts โ€” Each expert is a standard feed-forward neural network. In a model with 16 experts and top-2 routing, only 2 experts activate per token โ€” meaning only ~12.5% of expert parameters are used per inference step.

Router/Gating Network โ€” A small learned network that decides which experts handle each input. The router is trained jointly with the experts through a combination of the main task loss and load-balancing auxiliary losses.

Load Balancing โ€” Without intervention, routers tend to route most inputs to a few popular experts while others sit idle. Auxiliary loss functions and capacity constraints ensure balanced utilization across all experts.

MoE vs. Dense Models

Property Dense Model MoE Model
Active parameters All (100%) Subset (~12-25%)
Total parameters Limited by compute Can be much larger
Inference speed Proportional to total size Proportional to active size
Specialization All parameters serve all inputs Experts specialize by input type
Memory Lower (smaller total) Higher (full model in memory)
Training efficiency Straightforward Requires load balancing
Examples Llama 3.1 405B Mixtral 8x22B, GPT-4 (reported)

Notable MoE Models

Model Total Params Active Params Experts Released
GPT-4 (reported) ~1.8T ~280B 16 experts, top-2 2023
Mixtral 8x7B 46.7B 12.9B 8 experts, top-2 2023
Mixtral 8x22B 176B 44B 8 experts, top-2 2024
Switch Transformer 1.6T ~1.6B 2048 experts, top-1 2021
DeepSeek-V3 671B 37B 256 experts, top-8 2024

MoE and the Future of AI

MoE is becoming the default architecture for frontier models because it breaks the linear relationship between model capability and inference cost. Key trends:

  • Sparse-by-default โ€” New frontier models increasingly adopt MoE rather than dense architectures
  • Fine-grained experts โ€” Moving from 8-16 large experts to hundreds of smaller, more specialized experts (as in DeepSeek-V3 with 256 experts)
  • Open-source MoE โ€” Mixtral and DeepSeek have made MoE accessible to the open-source community
  • Hardware optimization โ€” GPU manufacturers are optimizing hardware for sparse computation patterns common in MoE

Further Reading:

  • Transformer: The base architecture. MoE replaces dense feed-forward layers in transformer blocks with sparse expert layers
  • Large Language Models: The primary application of MoE architecture, enabling larger and more capable models
  • Deep Learning: The broader field of neural network research that MoE architectures advance
  • Fine-Tuning: Adapting pre-trained MoE models to specific tasks, which can selectively update individual experts

Frequently Asked Questions About Mixture of Experts

What is Mixture of Experts in AI?

Mixture of Experts (MoE) is a neural network architecture that splits a model into multiple specialized sub-networks (experts) and uses a router to activate only the most relevant experts for each input. This makes models larger and more capable while keeping inference costs manageable.

Why is MoE more efficient than dense models?

In a dense model, every parameter activates for every input. In MoE, only a fraction of experts activate per input (typically 2 out of 8-16). A 176B-parameter MoE model might only use 44B parameters per query โ€” delivering quality comparable to a much larger dense model at a fraction of the compute cost.

Does GPT-4 use Mixture of Experts?

While OpenAI has not officially confirmed the architecture, credible reports indicate GPT-4 uses a Mixture of Experts architecture with approximately 1.8 trillion total parameters and 16 experts, routing to 2 experts per token.

What is the difference between MoE and model ensembles?

Model ensembles run multiple complete models and combine their outputs. MoE uses a single model with shared components and specialized expert layers โ€” only activating the relevant experts per input. MoE is more parameter-efficient and is trained end-to-end as one model.