Introducing SegMoE: Segmind Mixture of Diffusion Experts

Introducing SegMoE: Segmind Mixture of Diffusion Experts

Today, we're excited to announce the world's first open source Mixture of Experts (MoEs) framework for Stable Diffusion. Mixture of Experts (MoEs) are a class of Sparse Deep learning models which contain Sparse Linear Layers made up of the Linear Layers and Gate Layers. These gate layers route the input to only a few of the linear layers saving computation and decreasing inference time. This architecture allows the model to attain a large number of parameters while keeping the inference low.

Taking inspiration from the Mixtral model, we decided to create a framework similar to mergekit to combine Stable Diffusion models, which would increase the model's knowledge significantly while keeping the inference time close to a single Stable Diffusion model. Our framework, SegMoE, combines Stable Diffusion 1.5 and Stable Diffusion XL models into a single Mixture of Experts style model which boosts the prompt adherence and versatility of the models. In the Future, we aim to add support for more models as well as for training SegMoE models, which could potentially improve the quality of the models further and provide a new SOTA model for text to image generation.

What is SegMoE?

SegMoE models follow the same architecture as Stable Diffusion. Like Mixtral 8x7b, a SegMoE model comes with multiple models in one. The way this works is by replacing some Feed-Forward layers with a sparse MoE layer. A MoE layer contains a router network to select which experts process which tokens most efficiently. You can use the segmoe package to create your own MoE models! The process takes just a few minutes. For further information, please visit the Github Repository. We take inspiration from the popular library mergekit to design segmoe. We thank the contributors of mergekit for such a useful library.

About the name

The SegMoE MoEs are called SegMoE-AxB, where A refers to the number of expert models MoE-d together, while  B  number refers to the number of experts involved in the generation of each image. Only some layers of the model (the feed-forward blocks, attentions, or all) are replicated depending on the configuration settings; the rest of the parameters are the same as in a Stable Diffusion model. For more details about how MoEs work, please refer to the "Mixture of Experts Explained" post.

Inference

We release 3 merges on the Hub:

  1. SegMoE 2x1 has two expert models.
  2. SegMoE 4x2 has four expert models.
  3. SegMoE SD 4x2 has four Stable Diffusion 1.5 expert models.

Samples

Images generated using SegMoE 4x2                                                  

Images generated using SegMoE 2x1:

Images generated using SegMoE SD 4x2                                                                    

Disclaimers and ongoing work

  • Slower Speed: If the number of experts per token is larger than 1, the MoE performs computation across several expert models. This makes it slower than a single SD 1.5 or SDXL model.
  • High VRAM usage: MoEs run inference very quickly but still need a large amount of VRAM (and hence an expensive GPU). This makes it challenging to use them in local setups, but they are great for deployments with multiple GPUs. As a reference point, SegMoE-4x2 requires 24GB of VRAM in half-precision.

SegMoE is comprehensively integrated within the Hugging Face ecosystem and comes supported with diffusers.

Conclusion

We built SegMoE to provide the community a new tool that can potentially create SOTA Diffusion Models with ease, just by combining pretrained models while keeping inference times low. We're excited to see what you can build with it!