In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > Mixture of Experts - How Modern LLMs Scale Efficiently

Mixture of Experts - How Modern LLMs Scale Efficiently

Author: Venkata Sudhakar

ShopMax India's AI team evaluates LLM providers for their services. Understanding how Mixture of Experts (MoE) architecture works helps make better decisions about model selection and cost. MoE is used in Mixtral 8x7B and GPT-4, enabling much larger model capacity at a fraction of the inference cost of dense models.

In a standard dense transformer, every parameter is activated for every token. MoE replaces the feed-forward layers with N expert sub-networks and a router that activates only K experts per token (typically K=2). A 47B parameter MoE model like Mixtral 8x7B only computes 13B parameters per token, matching a 13B dense model speed while achieving quality closer to a 47B model.

The below example demonstrates loading Mixtral 8x7B via Hugging Face Transformers with 4-bit quantization to run a product warranty query for ShopMax India.

It gives the following output,

ShopMax India provides a standard 1-year manufacturer warranty on all
electronics, with an option to purchase an extended 2-year warranty at
checkout. Warranty claims can be initiated online or at any service
centre in Mumbai, Delhi, Bangalore, Hyderabad, and Chennai.
Active parameters per token: ~13B of 47B total (2 of 8 experts)

MoE models require more total GPU memory to load since all experts must fit in VRAM even though only 2 are active per token. Use 4-bit quantization to reduce memory usage when running locally. For ShopMax India API deployments, hosted MoE models via Mistral AI or Together AI are more cost-effective than self-hosting, as they amortize the memory cost across many concurrent users.

Send your comments, suggestions or queries regarding this site to [email protected].