|
|
Mixture of Experts - How Modern LLMs Scale Efficiently
Author: Venkata Sudhakar
ShopMax India's AI team evaluates LLM providers for their services. Understanding how Mixture of Experts (MoE) architecture works helps make better decisions about model selection and cost. MoE is used in Mixtral 8x7B and GPT-4, enabling much larger model capacity at a fraction of the inference cost of dense models.
In a standard dense transformer, every parameter is activated for every token. MoE replaces the feed-forward layers with N expert sub-networks and a router that activates only K experts per token (typically K=2). A 47B parameter MoE model like Mixtral 8x7B only computes 13B parameters per token, matching a 13B dense model speed while achieving quality closer to a 47B model.
The below example demonstrates loading Mixtral 8x7B via Hugging Face Transformers with 4-bit quantization to run a product warranty query for ShopMax India.
It gives the following output,
ShopMax India provides a standard 1-year manufacturer warranty on all
electronics, with an option to purchase an extended 2-year warranty at
checkout. Warranty claims can be initiated online or at any service
centre in Mumbai, Delhi, Bangalore, Hyderabad, and Chennai.
Active parameters per token: ~13B of 47B total (2 of 8 experts)
MoE models require more total GPU memory to load since all experts must fit in VRAM even though only 2 are active per token. Use 4-bit quantization to reduce memory usage when running locally. For ShopMax India API deployments, hosted MoE models via Mistral AI or Together AI are more cost-effective than self-hosting, as they amortize the memory cost across many concurrent users.
|
|