Topics

Mixture of Experts

Sparsely activating subsets of parameters so model capacity grows without proportional compute.

Mixture of Experts · Renmin University of China

Manifold Power Iteration: A Better Router for MoE Models

MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.

Mixture of Experts · National University of Singapore

dMoE: Block-Level Expert Routing for Diffusion LLMs

dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.

Code Generation · JetBrains

Mellum 2: A 12B MoE Code Model Running at 2.5B Compute

Mellum 2 is JetBrains' open-weight 12B Mixture-of-Experts code model that activates only 2.5B parameters per token, matching dense 4B-14B baselines on software tasks at a fraction of the per-token compute.

Fine-Tuning & Adaptation · Mind Lab

MinT: Infrastructure for Training and Serving Millions of LoRA LLMs

MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.

Open Models · Mistral AI

Mixtral of Experts: The 47B Sparse MoE That Runs Like a 13B Model

Mixtral 8x7B routes each token to 2 of 8 experts per layer, so it holds 47B parameters but uses only ~13B per token — and matches or beats Llama 2 70B and GPT-3.5 under Apache 2.0.

Multimodal Models · SenseTime

SenseNova-U1: One Model for Multimodal Understanding and Generation

SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.

Mixture of Experts · Google Research

Switch Transformer: One Expert Per Token, Up to a Trillion Parameters

Switch Transformer simplifies Mixture-of-Experts by routing each token to a single expert, hitting up to 7x faster T5 pretraining at fixed compute and scaling to 1.6 trillion parameters with bfloat16 training.