Baeldung - CS

Learn all about Computer Science

 

The Mixture-of-Experts ML Approach
2025-06-13 14:59 UTC by Enes Zvornicanin

1. Introduction

In this tutorial, we’ll introduce Mixture-of-Experts (MoE) models, a neural network architecture that divides computation among many specialized sub-networks we call experts.

An MoE decides which experts should process each input. Only the selected experts are activated for a given input example, whereas the rest remain inactive. This technique lets an MoE model hold billions of parameters but use only a fraction per input, significantly reducing computation.

In this tutorial, we’ll explain how MoEs work, compare them with standard dense models, and discuss when to use MoEs. Also, we’ll briefly mention some examples.

2. What Is a Mixture-of-Experts Model

A Mixture-of-Experts (MoE) model is a special neural network architecture that consists of three main components:

  • Experts
  • Gating network (router)
  • Sparse activation

Mixture of Experts

Experts are independent sub-networks (feed-forward layers), each specialized to handle specific data or tasks. For example, one expert might specialize in recognizing textures, while another might identify edges. This separation allows the overall model to tackle problems more efficiently, as each expert focuses only on the type of data or task it specializes in.

The gating network (router) is a part of the model that determines which input data is sent to which expert. It inspects each input and, based on that, assigns weights to each expert. Then, it decides which expert will receive the data. There are several different routing techniques available. For instance, the top-k routing technique selects the top-k experts with the highest weights or affinity scores and sends the input data to them.

3. Gating Mechanism and Output Calculation

3.1. Gating Network (Router)

The gating network determines which experts handle each input. Usually, it’s a small neural network that takes the input (e.g., a token embedding) and outputs a score for each expert.

For example, a common design is:

(1)   \begin{equation*} G(x) = \mathrm{softmax}(W_{g}x), \end{equation*}

where

  • x is a d-dimensional input
  • W_{g} is the gating network’s weight matrix of shape n \times d, and n is the number of experts.

This equation projects the input into an n-dimensional score vector (one score per expert). Then, the model keeps the top k scores and sets the rest to zero. As we can see, this approach implements token-to-expert routing, where each token selects which experts to route to based on its gating scores.

The alternative is called expert-to-token routing, where experts select which tokens they want to process. For instance, the gating score is computed separately for each expert. Based on that score, each expert selects the top k tokens with the highest score. This approach is also called expert choice routing.

Lastly, there are more sophisticated routing algorithms that combine both token-to-expert and expert-to-token logic. For example, BASE Layers (Balanced Assignment of Sparse Experts) jointly consider the preferences of all tokens and the capacity constraints of all experts, aiming for a globally balanced and efficient routing strategy.

3.2. Computing the Output

After that, the chosen expert sub-networks compute their outputs while other experts do nothing, which reduces the cost of the forward pass. The model combines the outputs of the selected experts into the final output. For example, it can construct the final output as a weighted sum using the gating scores. Also in practice, adding a small amount of randomness (noise) to the scores during training can help balance the load among experts, preventing the router from sending all inputs to the same expert every time.

Importantly, this design means the MoE’s total number of parameters can be huge (sum of all experts), but the active parameters per input remain low. In this way, MoEs learn to determine which experts will most effectively process a given input.

4. MoEs vs. Dense Models

Traditional dense neural networks apply all their parameters to every input. This means that parameter count and per-input compute grow together. In an MoE, the total parameter count can grow much larger than the average computation per input, because it uses only a subset of parameters. In practice, the active parameter count (the sum of parameters in the chosen experts) drives the computational cost.

Because only a few experts run, MoEs often train and infer faster for the same level of accuracy, compared to a dense model with the same compute. Empirically, sparse MoEs have shown a better “speed-quality” trade-off than dense networks of similar resource usage. Conversely, a dense model has more straightforward computation (no routing overhead) and might be preferable in small-scale applications for which a simpler model suffices.

Scaling MoE models in distributed settings is significantly easier than scaling dense models. In an MoE, each expert can be on a different device, and since only a small subset of experts is active per input, each device only processes a fraction of the data. As a result, we have a reduction in communication and memory overhead so MoEs can scale to hundreds of billions of parameters efficiently across multiple GPUs or TPUs. In contrast, dense models load and use all their parameters for every input, often requiring full model replication or complex model-parallel strategies.

Conceptually, MoE is like a team where only the relevant specialists work on a task. By routing an input to only a few experts, the MoE avoids computing with the entire network, saving computation. Consequently, adding more experts increases the total capacity, but because most experts are idle when processing input, the computation per example doesn’t significantly increase.

5. MoEs vs. Ensemble Models

MoE and ensemble models both use multiple sub-models to improve performance, but differ in how they operate. Ensemble models usually run all sub-models in parallel and combine their predictions, often through averaging or voting. In contrast, MoE models use a gating mechanism to dynamically select and activate only a few experts per input, enabling conditional computation and making them more efficient, especially at scale.

While ensembles aim for stability and generalization by aggregating multiple outputs, MoEs focus on specialization and scalability by routing inputs to the most suitable experts.

Mixture of experts vs ensemble models

6. When to Use These Models?

MoE models are very beneficial when dealing with diverse data or when a large problem can be broken into smaller pieces. For instance, in multi-modal problems, when data comes from different sources or represents various types of information, MoEs can dedicate experts to each modality, leading to better results.

As MoE models can grow huge without becoming super slow or expensive and are sparse and easily distributed to many devices, we can use MoEs as large language models (LLMs). Leading LLMs have begun to adopt MoEs to reach a trillion-parameter scale. By routing tokens to a few experts, they avoid making inference prohibitively slow or expensive.

7. Drawbacks

However, MoEs also have drawbacks. They require a sophisticated distributed setup to implement expert parallelism and routing.

All experts must be stored in memory, even if not all are used in each forward pass, increasing memory usage.

The routing network introduces more hyperparameters and possible instability. In this context, instability means that the model might have unpredictable or undesirable training behavior due to the routing network’s sensitivity to its hyperparameters.

A common challenge is that the routing network can favor only a few experts. This means other experts don’t get enough training examples to learn effectively. To ensure the entire model works at its best and uses all its capacity, the router must ensure each expert is trained. Successfully tuning the hyperparameters for these complex networks and keeping them stable during training is a true art.

If we have lots of data and hardware and need extreme capacity, MoE can pay off. However, if we have hardware constraints or a smaller problem, a conventional dense model may be preferable.

MoE has become a popular technique, especially in large-scale deep learning. Its ability to increase model capacity while managing computational costs has led to the development of several popular and influential architectures.

Here are some of the most notable MoE models:

Model Total parameters Number of experts k (active experts)
DeepSeekMoE 16B 16 Billion 64 + 2 shared 6
DeepSeek-V2 236 Billion 160 + 2 shared 6
DeepSeek-V2.5 236 Billion 160 + 2 shared 6
Grok-1 314 Billion 8 2
Mixtral 8x7B 46.7 Billion 8 2
Mixtral 8x22B 141 Billion 8 2

Shared experts are those that are always active. Besides popular models from the table, it’s widely reported and understood in the AI community that recent versions of GPT models, including GPT-4 and its successors like GPT-4o, incorporate MoE principles to achieve their scale and performance. The exact details are not available since these models are proprietary.

While MoE adoption is most visible in NLP, its principles are also being applied and explored in computer vision. This line of research often focuses on applying MoE layers to Vision transformer (ViT) backbones or other vision architectures.

9. Conclusion

In this article, we explored the Mixture of Experts (MoE) approach and its advantages in building scalable, efficient machine learning models.

MoEs enable conditional computation, allowing only a subset of the model to be active at a time. It reduces resource usage and increases model capacity. This makes them particularly well-suited for large-scale distributed systems. As the demand for bigger and smarter models grows, MoE stands out as a practical solution for scaling.

The post The Mixture-of-Experts ML Approach first appeared on Baeldung on Computer Science.
 

Content mobilized by FeedBlitz RSS Services, the premium FeedBurner alternative.