Diverse Expert Ensembles: embarrassingly parallel LLMs from diverse experts

This is an academic paper that finds benefits to heterogeneity (different model sizes and number of training steps) when training embarrassingly-parallel ensembles of expert models. We find that diverse models, trained for varying periods of time over diverse compute sources, are more performant when merged than homogeneous models trained with identical settings.

Diverse Expert Ensembles

Mixture-of-Experts models continue to gain traction, owing to their computational efficiency and performance. Yet, despite their inherent modularity, we still train them in the same way we do dense, monolithic networks. That is, within large data centers, orchestrated by single organizations, and using homogeneous compute and expert hyperparameters.

We’ve long held at Gensyn that AGI will take the form of an open ecosystem of interconnected models - similar to the internet itself - rather than a monolithic model from one company. And further, that model diversity will enhance this ensemble, rather than weaken it.

Today we’re excited to announce our initial findings in support of this thesis.

Introducing Heterogeneous Domain Expert Ensemble (HDEE)  

HDEE is a framework for creating Diverse Expert Ensembles, i.e. heterogeneous MoE models trained in an embarrassingly parallel fashion. 

HDEE builds on top of the embarrassingly parallel training method of Branch Train Merge (BTM) from Li et al, 2022, where they show that “it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs”. Our work demonstrates that extending BTM to use heterogeneous model sizes and training parameters (i.e. supporting heterogeneous compute capabilities) actually increases the performance of the subsequent merged model.

Specifically, instead of training all experts with identical configurations, HDEE tailors each expert based on its data domain and/or compute capabilities. For simpler domains, a smaller model (or one trained with fewer iterations) is used; for more challenging domains, larger models and extended training are applied. 

These configurations ensure that each expert is optimally tuned, resulting in a more capable ensemble. Overall, the heterogeneous ensemble achieves the best perplexity in 20 out of 21 evaluated domains compared to the baseline, using an equivalent compute budget.

Looking Ahead

HDEE provides an early glimpse into an open model ecosystem. Independent developers can train models on diverse hardware using configurations that suit their specific data and areas of expertise. HDEE and similar methods can ensemble them together into meta-models that route queries along the best path, in many ways resembling the internet itself.

Gensyn is building the underlying infrastructure to enable this, allowing developers to train, verify, and merge their models over all of the ML-capable hardware in the world.

To learn more, you can read the full paper here.

HDEE is fully open source and we encourage the research community to build on top of the code.