MoE @ MOOS

What works for the breakthrough for OpenAI might also work for MOOS

2024-02-24Perspective

Unveiling the Power of Mix-of-Experts (MoE)

On July 11th, 2023, a Reddit thread exposed the architecture behind OpenAI's GPT-4. It was only live for a couple of hours, but it clearly outlined the magic potion.

Everyone knows they used transformers (the T in GPT), a brute-force approach (the 'large' in LLM), and a human-feedback loop in their model training approach. What only a few people know – and is a closely guarded secret – is how they pulled it off. Access to all the text in the world and unlimited computing power is not enough to create a jump in model performance.

It seems they used a new Mix-of-Experts (MoE) architecture that created the real breakthrough. Or more specifically, a 220B parameter, 16-way MoE model with 8 sets of weights.

Why? Simply because this novel architecture outperforms previous AI architectures in terms of predictive power for multi-dimensional problems with data constraints for the complexity of the phenomenon. This is relevant for text or image generation (the G in GPT), but also for many other classes of AI challenges.

One of them is the signal and low-res image interpretation at MOOS.

MoE? A Brief Look Under the Hood

A Mix-of-Experts (MoE) AI architecture is a neural network design that combines the strengths of specialized sub-networks or "experts" to improve overall performance. In this architecture, the input data is distributed among different expert modules, each specializing in specific patterns or tasks. A gating network then determines the contribution of each expert to the final output.

Benefits of a Mix-of-Experts AI architecture include:

Specialization: Experts focus on specific aspects of the input data, allowing the model to handle a wide range of tasks effectively.
Efficiency: By assigning different parts of the input to specialized experts, the model can process information more efficiently and make better use of computational resources.
Adaptability: The gating mechanism enables the model to dynamically adjust the contribution of each expert based on the characteristics of the input data, enhancing adaptability to various tasks or changing conditions.
Improved Performance: The combination of specialized experts often leads to enhanced overall performance compared to a single, general-purpose model, especially in complex tasks with diverse patterns.
Robustness: MoE architectures can be more robust to variations in input data, as different experts can handle different aspects of the input, reducing the model's sensitivity to specific types of noise or uncertainties.

In summary, a Mix-of-Experts AI architecture leverages specialization and dynamic adaptation to achieve improved efficiency, adaptability, and overall performance across a range of tasks.

MoE as a Game Changer for MOOS

At MOOS, we have adopted the MoE architecture for our AI engine that interprets signals from sensors. The MOOS paper-based sensors create a matrix of output signals, corresponding loosely with the force on each node. This creates low-resolution or blurry images of the products on a MOOS sensor, which also has quite some variance and disturbances over time.

Interpretation of this fuzzy and unstable low-res image is not an easy task. Simply eye-balling it or using an algorithmic approach can already create a reliable interpretation if a sensor is full or empty, but any product recognition and somewhat reliable count of products is not easy. Especially if you need to identify different products on the same sensor. This is a great ML task, but not all ML techniques perform equally well.

Breaking down the problem into "experts" that all address sub-problems and then work together do a far better job than one big ML model. Even if you have all the training data you could wish for. At MOOS, we have recognized this and created multiple experts, each with their own task. For instance, one expert can focus on recognizing a known product in a certain surface-space (e.g., one track or retail facing in a retail shelf), other experts can focus on completely different problems, like recognizing different products, shifted products, fallen products, etc. Also, different types of products can trigger the need for different experts.

Obviously adding experts can increase complexity and processing. We are very sensitive to the fact that when a simple approach is just as good, this is obviously preferred. Luckily, the MoE architecture is flexible and smart enough to route problems to a simple solution and only add complexity when it is needed. As such, MoE provides MOOS with a flexible architecture to solve the simple and the current, but also be future-proof for more advanced or more challenging asks in certain situations and usage settings.

Building & Leveraging MoE

Our journey leveraging MoE is promising. While already showing superior performance with current MOOS applications, there is still a huge potential to tap into going forward. This includes expanding the scope of generalization across product types, getting increasingly sensitive to discern different types of products, or increasing the flexibility for different placements on a sensor. All of these topics are on the roadmap. We are building these on the back of our expanding client application and use-case.

For instance, one client asks us for quicker recognition of the event of picking up a product from a shelf, to be able to communicate a message (cross-sell or announce a promo). No problem. Such an ask can be added as a new expert.

At MOOS, we’re proud to leverage state-of-the-art AI architectures and are excited by the potential.