Topics
Mixture of Experts
Sparsely activating subsets of parameters so model capacity grows without proportional compute.
Mixture of Experts · Renmin University of China
MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.
Mixture of Experts · National University of Singapore
dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.
Code Generation · JetBrains
Mellum 2 is JetBrains' open-weight 12B Mixture-of-Experts code model that activates only 2.5B parameters per token, matching dense 4B-14B baselines on software tasks at a fraction of the per-token compute.
Fine-Tuning & Adaptation · Mind Lab
MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.
Open Models · Mistral AI
Mixtral 8x7B routes each token to 2 of 8 experts per layer, so it holds 47B parameters but uses only ~13B per token — and matches or beats Llama 2 70B and GPT-3.5 under Apache 2.0.
Multimodal Models · SenseTime
SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.
Mixture of Experts · Google Research
Switch Transformer simplifies Mixture-of-Experts by routing each token to a single expert, hitting up to 7x faster T5 pretraining at fixed compute and scaling to 1.6 trillion parameters with bfloat16 training.