← TinyGPT · docs · devlog · roadmap · speedup
source: docs/moe.md · view on GitHub ↗

Mixture-of-Experts — more capacity per byte of weight

MoE replaces a transformer block’s single dense MLP with N “expert” MLPs plus a learned router that picks the top-K experts for each token. The result: a model with ~N× the parameter capacity of the dense baseline at the same per-token FLOP budget — once the sparse dispatch kernel lands.

Reference: Fedus et al., 2021 (Switch Transformer); Jiang et al., 2024 (Mixtral-of-Experts).


Why MoE matters at our scale

The user is shipping on a single 48 GB M-series Mac. The hard wall is “can the model fit in RAM” — not “how many FLOPs per token.” A 2B- parameter MoE (8 experts × 256M dense) loads as ~4 GB at bf16, but activates only ~500M params per token. So the capacity you can expose locally is bigger than what a dense same-FLOP model could deliver. That’s the qualitative leverage MoE buys.

The compute saving — the famous “5× cheaper than the dense equivalent” claim — requires a real sparse scatter-gather kernel. Our first cut runs every expert on every token, weighted by the router, so per-token FLOPs are higher than dense. The architecture is correct; the perf knob is a follow-up.

What’s wired today

tinygpt train --moe-experts N --moe-topk K --moe-aux-weight F

The MoE block adds:

Save/load works: MoE models serialise to .tinygpt with extended manifest entries (blocks.N.moe.router.weight, blocks.N.moe.experts.E.fc_in.weight, etc.) and nExperts/moeTopK/ loadBalanceWeight in the JSON header. Resume restores the same router + expert layout. The standard sample, eval, and inspect paths read these new entries via the existing header → ModelConfig flow.

Smoke result

On a 200 KB corpus, tiny preset, 30 steps, byte-level:

ConfigParamsLoss (init → 30)step/s
Dense MLP842 K6.09 → 1.7655.6
MoE 4 experts top-22.42 M5.95 → 1.6829.3

The MoE has 2.88× the parameters and reaches a marginally lower loss in the same step count, despite the slower per-step throughput (every expert runs on every token). On the real test — longer training on real data — the parameter-capacity gap is expected to widen meaningfully.

What’s NOT shipped yet

Hyperparameter notes

Where to look in the code