← TinyGPT · docs · devlog · roadmap · speedup
source: docs/PLAN.md · view on GitHub ↗

TinyGPT — master plan

Last verified against codebase: 2026-06-06 (eval-pipeline + serve fix + elf PRDs landed; product framing clarified to “Mac platform for building/upgrading specialists”) Sources merged: docs/roadmap/* · docs/progress.md · docs/backlog.md · docs/feature_audit_2026_05_31.md · docs/roadmap/recent_research.md (paper catalogue → §4)

Product framing (clarified 2026-06-06): TinyGPT is a Mac platform for individuals to build and upgrade specialist models for their specific tasks — bring data, pick a local teacher, ship a fast/cheap specialist. Distillation + LoRA + QLoRA + constrained decoding are the toolkit. Local teacher = no API spend. Comprehensive multimodal roadmap (text/code/vision/voice/image-gen) under disciplined “one canonical best per slot” principle. Canonical strategy doc: docs/sessions/2026-06-06-mac-specialist-platform.md — covers Tier 1-4 backlog, multi-model architectures (phone-a-friend / cascade / LoRA hot-swap / etc.), structured-output formats beyond JSON (incl. Protobuf / SQL / GraphQL via grammar), and flagship example apps (browser agent, per-language code specialist, voice command, etc.).

Three sections — shipped, skipped, TODO. Every claim verified against the code. The first audit caught Lion/Sophia/Muon/PEFT-bundle/ gradient-clipping; the second caught YOCO + GPTQ-reader + token-elim (dropped under value-add filter); the third caught embedding RMSNorm, cosine warmup, layer-wise LR decay, DeepNorm, BPE-dropout, Real CI — all shipped, all previously marked ⬜.

Status legend

MarkMeaning
shipped — verified against code today
🟡partial / in-session-only / verified-with-caveat
TODO — in active backlog
deferred — would build but waiting on external trigger
skipped — intentionally not built (better alternative exists)
🚧blocked — would build but cannot right now (hardware / upstream / budget)

1. SHIPPED

Mac runtime + CLI

Audit baseline: every CLI smoke-tested on M5 Pro 2026-05-31. See feature_audit_2026_05_31.md for the full smoke trace. 30+ subcommands all green.

Mac training + post-training

Training stability (verified 2026-06-02 — these were all marked ⬜ in older docs)

PEFT bundle

All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, all gated through tinygpt sft:

Inference + sampling

Quantization + compression

Optimizers

Architecture variants

Tokenization

Interpretability tools (browser playground)

Browser / Web track

WebGPU kernels (in webgpu/train*.wgsl)

Numerics-gate framework — every fast path (f16-storage, f16-compute, coop-matrix, subgroup) carries its own gate that compares against a f32 reference with a magnitude-aware tolerance. Gate-fail → silent fallback, zero regression risk. See docs/precision.md.

Datasets + data pipelines

Tooling + infra

Headline metrics (Mac, M5 Pro / 48 GB)

ValueTargetHeadroom
TTFT (warm)5.8 ms p99< 50 ms✅ 10× under
ITL p994.9 ms< 30 ms✅ 6× under
Decode tok/s293 (mega-pilot 960M) → 696 (huge 221M)> 50 tok/s✅ 6× over
Cold start TTFT24 ms (1B)< 50 ms✅ 2× under
Training Huge42 ms/step(baseline)
Speedup vs browser17.2×(baseline)
Largest model960 M params (1.1 GB)

Recent product surfaces (Wave 2.6, shipped 2026-05-31)

Learning artifacts (docs)


2. SKIPPED

❌ Superseded by better alternatives

❌ Dropped after audit (real cost, no payoff at our scale)

⏸ Deferred (waiting on external trigger)

ItemTriggerWhy deferred
cider W8A8 adoptiona 3B+ specialist shipsAt ≤ 1B, Mac already 10× under realtime; cider’s prefill win is immaterial
ANE + GPU heterogeneous routingApple ships Stateful Models API (rumored late 2026)Research-grade; current path uses private ANEMLL APIs
WebGPU subgroup matmul redesignbrowser focus returnsCurrent gate fails (1415% mean_rel); fallback works
Vision encoder (ViT → tinygpt decoder)vision-specialist demand becomes concrete2-week research-grade work; not critical-path
Audio I/O (Speech.framework + AVSpeechSynthesizer)voice-mode demo becomes priorityNot in scope for Wave 3
Async tool-call dispatchparallel-tool specialist shipsLM dominates 5-100× over subprocess at current scales
ScreenCaptureKit raw image (CGS-init fix)vision specialist needs raw bytesAX tree sufficient for tool-calling specialists
Public launch (HF + writeup + HN)≥ 1 specialist beats a fair baselineNothing to launch yet
Phase 7 browser perf (subgroups / coop-matrix / WebNN)post-HN v2 pushCurrent 12.1× lift is the launch story

🚧 Blocked by hardware

🚧 Blocked by upstream library state

ItemBlockerWorkaround
QLoRA real-quantized base + LoRAMLX-Swift quantized arrays don’t autograd throughManual fake-quant in fwd (pedagogical, no memory win)
Sparse MoE hard routingMLX-Swift no scatter_addSoft (dense) routing shipped
Mixture-of-Depths hard top-KsameSoft sigmoid gate shipped
Fast BPE encodingswift-transformers single-threaded; 2 GB corpus = ~30 minRust-backed encoder via FFI (future)
Native int4 / int8 WebGPU matmulspec doesn’t yet have quantized matmul extensionsWait for subgroup / coop-matrix extensions
GGUF safetensors readernot yet writtenCould write (~2 days); AWQ + GPTQ readers already ship

🚧 Blocked by budget


3. TODO

ROI-ordered. Sourced from backlog.md (the living list, last sort 2026-05-31).

Tier A — DO NEXT (north-star aligned; specialists)

Until A1 lands, every optimization is theoretical. Until Tier E (eval pipelines) lands, every specialist is unmeasurable — A1’s “ship” criterion implicitly requires E1 + E3 wired before any score can be published. Sequencing: Tier D (data) + Tier E (evals) → A1 specialist → Tier B follow-ups.

Tier D — DATA (gaps blocking specialists)

Pulled today: hermes-fc.jsonl, ultrafeedback.jsonl, MetaMathQA, alpaca-cleaned, orca_dpo_pairs, FineWeb-Edu (50K-row sample via parquet decoder). Blocked / missing for the planned specialists:

Tier E — EVAL PIPELINES (wire harnesses → automate scores)

Source code for BFCL / τ-bench / lm-eval-harness is already on disk under ~/.cache/tinygpt/datasets/_external/. Pulling source ≠ usable evaluator. Each item below is the wiring work — a tinygpt eval-<name> subcommand that takes a model path, runs the harness via subprocess, parses the score JSON, returns a clean number. Until these land, “did the specialist learn anything?” has no automated answer.

Architectural constraint (decided 2026-06-05): every E* item MUST emit structured JSONL conforming to a shared eval schema (E0). That makes two critical comparisons possible:

  1. Cross-model: TinyGPT vs SmolLM2 vs Qwen3 vs Phi-mini on the same task — without this, “we trained a model” doesn’t answer “is it any good?”
  2. Cross-checkpoint (training dynamics): every save-history checkpoint scored against the same task → see WHEN a capability emerges. Pairs with B13 interp-on-checkpoints — interp explains WHY features appeared, eval confirms IF they’re useful.

Both fall out for free if E0 + E8 are designed in, not retrofitted.

Browser viewers shipped 2026-06-05

PageRole
/eval-leaderboard.astrodrag-drop E0 JSONL → 3-view comparison (by step / model / task)
/sae-timeline.astrodrag-drop B13 SAE timeline JSONL → MSE-over-step + L0-over-step charts

Rust performance tools shipped 2026-06-05

CrateRole
scripts/parquet-decoder/replaces python3 scripts/parquet_to_txt.py; static binary, no pyarrow
scripts/hf-downloader/parallel HF shard fetches with progress + retry + resume
scripts/humaneval-sandbox/E5 supporting sandbox runner (Rust + macOS sandbox-exec)

Eval — runbook artifacts shipped 2026-06-05

ScriptRole
scripts/score-checkpoint.shone .tinygpt → E0 JSONL row(s) via lm-eval
scripts/score-run.shevery history checkpoint of a run + SmolLM2 baseline → JSONL + 3-view summary
scripts/sae-run.shSAE-per-checkpoint sweep → JSONL timeline (B13 v2 input)
scripts/score-baselines.sh5 HF baselines (SmolLM2-135/360M, Qwen3-0.6B, TinyLlama, Phi-3-mini) on the same task set

Total Tier E: ~6-8 focused days. Do E0 first (schema is everyone’s dependency), then E3 (highest harness leverage), then E1 (A1 ship-blocker), then E8 (multi-checkpoint), then the rest as nightly arcs.

Tier B — NEXT QUARTER (multi-specialist + product)

Pretrain + runtime quality (added 2026-06-04 — “good product” lens, not launch optics):

Competitor-aware additions (added 2026-06-04 — surfaced by web sweep, not Jan-2026 cutoff knowledge):

Castform-inspired training-pipeline trio (added 2026-06-13 from docs/learn/castform-rl-finetune.md):

Market-landscape positioning (added 2026-06-13 — see docs/sessions/2026-06-13-market-landscape-mac-first.md):

The competitive scan found the whole field monetizes the cost a Mac-first tool zeroes out (cloud GPU rent / trace ingestion) and is consolidating into infra + frontier-lab acquirers. Three whitespaces: Mac-first training as a product (B6 + B31), eval+interp+local fused (already shipped — the moat), and academic agent benchmarks as a local CI gate (B32). These two items reframe shipped infra as product surfaces.

External-leaderboard arc (added 2026-06-05 — first public competitive submission target):

Tier C — POLISH (mostly shipped this session)

Tier 5 — RESEARCH FRONTIER (2026 stretch goals)

Pauses the “training at 2024 fundamentals” cadence; deliverable is a paper-shaped artifact + reproducible code + a scaling-curve point, NOT a polished UX feature.

5.6 TTS toy — detailed scoping

What carries over from current TinyGPT:

PieceReuse
Transformer decoder, KV cache, sampling, MTP heads (for K-codebook prediction)direct
Training loop (tinygpt train) + PEFT bundle for downstream fine-tunesdirect
CrossAttention.swift (currently used for YOCO)adapt to text-encoder K/V source for conditioning

New code surface (~2 weeks of focused engineering + 3-7 days training):

PieceEffort
EnCodec encode/decode integration (Swift port of the HF EnCodec weights)~3-5 days
Text → conditioning surface (text encoder + cross-attention into decoder, OR text-as-prefix-tokens)~2-3 days
Audio data pipeline (LJSpeech / LibriTTS pre-tokenization to codec ids)~2-3 days
Eval (WER via Whisper transcription, MOS estimator)~2 days
First training run on LJSpeech single-speaker → intelligible speech2-4 days wall-clock

Realistic outcome at this scale: smallest-published audio-token GPT (MusicGen-small) is ~300M; from-scratch on LJSpeech you get recognizable but not natural-sounding speech. The publishable artifact is the same shape as 5.3 — “smallest from-scratch ___ on consumer hardware.”

Why ordered after specialist + VL:

5.7 Specialized explainer-video model — Lamina-like track

Reference product: Lamina Labs’ Simi positions itself as an AI explainer studio: prompt or document in, whiteboard-style educational video out for students, course creators, customer training, and teams. The public lesson is not “train a giant cinematic video model”; it is “make a narrow video system that explains accurately, quickly, and consistently.”

The TinyGPT version should start as a structured explainer compiler:

source document / prompt
  -> lesson script
  -> storyboard scenes
  -> visual DSL (objects, labels, arrows, equations, timeline)
  -> deterministic renderer (SVG/canvas/Remotion/Manim-style)
  -> captions + voiceover + MP4

What we would need:

PieceBuildWhy
Scene/storyboard schemaJSON DSL for concepts, equations, diagrams, timings, camera/stroke actionsGives the model a constrained target instead of free-form pixels
RendererStart with SVG/canvas frames; later Remotion/Manim exportDeterministic, debuggable, cheap to render
Visual-planner specialistSFT/LoRA model: prompt/doc → storyboard DSLThis is the first “specialized video model” worth training
Asset/diagram libraryShapes, arrows, axes, code blocks, graph layouts, simple physics/math primitivesExplainers need reusable semantic primitives more than photorealism
Data pipelinePair open lessons/transcripts/docs with generated or human-edited storyboardsThe scarce asset is supervised storyboard data
Eval setHeld-out concepts with rubric: factual correctness, visual grounding, pacing, label consistency, equation validityPrevents “pretty but wrong” videos
Editing loopUser can regenerate one scene, lock script, lock diagrams, export MP4Real workflows need partial repair, not one-shot magic

Model ladder:

  1. No learned video model: use a strong text model or cloud model to produce the DSL; render deterministically. This validates product and schema fast.
  2. Tiny visual-planner specialist: fine-tune tinygpt/HF-loaded base on prompt/doc → storyboard DSL. This is the first trainable model.
  3. Visual critic/evaluator: model scores whether scene frames match the script and flags bad labels, missing objects, impossible diagrams.
  4. Optional diffusion/image/video model: only for decorative assets or scene backgrounds after the deterministic explainer path works.

Good first eval tasks:

EvalMetric
Concept-to-storyboardJSON validity + human/LLM rubric on lesson coverage
Equation/diagram correctnessSymbol/label exactness, graph/axis consistency
Script-to-scene groundingEvery narrated claim maps to an on-screen object/action
PacingScene duration fits narration without overcrowding
EditabilityRegenerate one scene without changing locked scenes

Why this is plausible for TinyGPT:

Why it stays behind the current specialist track:

Unshipped techniques — after applying the value-add filter (and re-auditing)

Most items in the original roadmap-categories list either ship today (third-audit corrections below) or were dropped under the user’s “don’t list a technique unless it adds genuinely new capability” filter. What’s left:

Genuinely new value-adds, not yet built

After this session’s batch closes, the non-training surface IS exhausted — every capability item under your value-add filter has shipped. Only niche residue remains:

After these: training-dependent (specialist Wave 3, Mini-Llama+ANE, Tier 5 modality arcs) or upstream-blocked (sparse MoE hard routing on scatter_add, real QLoRA on quantized-gradient flow).

Shipped this session (third → fourth audit pass corrections):

Stale ⬜ markers caught + corrected this session — now ✅:

ItemWhere it ships
Embedding RMSNorm--embedding-rmsnorm flag, RMSNorm module on token-embed
DeepNorm--deep-norm flag, cfg.useDeepNorm/deepNormAlpha/deepNormBeta
Layer-wise LR decaycfg.lrLayerDecay
Cosine warmup--lr-schedule cosine --warmup 500 (the curated default)
BPE-dropoutBPEDropout.swift
Real CI.github/workflows/ci.yml + deploy.yml
Persistent tokenized cacheTokenCache.swift wired into Train+Eval+Distill+Finetune
Linear probestinygpt linear-probe (this session, 6dbe15c)
YOCO cross-layer KV--yoco flag, CrossAttention.swift, docs/yoco_results.md
GPTQ safetensors readerGPTQReader.swift (72 tensors quantised in 31s)

Dropped under value-add filter (duplicate / inferior / niche):

DroppedWhy
ReLoRAGaLore already gives “full fine-tune at LoRA memory cost”
Prefix tuning / soft promptsLoRA covers the practical case
IPODPO with high β covers tiny-pair regularization
Token eliminationStreamingLLM + KIVI cover positional + per-entry-bits axes
Tree decodingSpeculative decode (vanilla + Medusa + EAGLE-2) covers the niche
Curriculum learningModest gains, scale-dependent; needs a difficulty metric we don’t have
Self-instruct / Evol-instructMagpie subsumes (uses model’s own distribution, no seed needed)
Hard example mining / Importance samplingMarginal at our scale
Data quality filteringPPL-filtering needs a ref model; basic dedup covers most of the value
BigBird / Longformer sparse attentionOnly matters past ctx=8192 (we don’t train at that length)
Linear attention (Performer / Linformer / Reformer)Quality usually worse than flash attention
Hybrid attention/SSM (Jamba, Samba)Different family; side-project
Pre-norm vs post-norm toggleConfig knob, not a feature
Tiktoken adoptionswift-transformers handles BPE-family tokenizers already
Subword regularizationMarginal vs BPE-dropout
Train own BPE on corpusModest gain (~5% PPL); blocked on Rust-FFI for speed
TinyGPT-as-library APIUser explicitly deferred until specialists beat a baseline

Queued findings — ANE routing + Mac-vs-browser sampling

Triggered by the question “how do we get to 170× instead of 17×?” The 17.2× number is at Huge training — small bandwidth-bound model where kernel-launch overhead dominates. Several legitimate paths to a much larger ratio; each is queued with its honest cost.

1. Browser sampling tok/s harness — CHEAP, ~30 min

Closes a real missing measurement. We have Mac sampling tok/s (293-696 by model size) but no analogous browser-side number. The playground worker generates via GpuModel.generate already; we just don’t time it.

What: in browser/src/worker.ts, log per-token wall-clock in the generate loop, post a sampling_perf message, display tok/s next to the playground output.

Expected ratio: Mac-vs-browser sampling probably 30-80× at Huge based on shape priors (Mac is much less kernel-launch-overhead sensitive during decode than during training). That alone changes the headline from “17× training” to “30-80× sampling, 17× training.”

Why queued: tiny work, just hasn’t been done. No blockers.

2. ANE-routed inference via Mini-Llama TinyGPT — MEDIUM, 1-2 weeks

Apple Neural Engine routes only when the graph hits its preferred shapes. The published numbers (ANEMLL, perf-quest memory) are 2-3× sampling over the same model on Apple GPU when ANE engages cleanly, not 100×+ end-to-end. The big win is the combined ratio: bigger ANE-friendly model × ANE-routing × already-unfit-for-browser size.

Why TinyGPT doesn’t route today

ANE prefers head_dim ∈ {64, 128}, tensor dims multiples of 64, fp16, RoPE-style attention, bias-free linears, RMSNorm. Our Huge default is the opposite of all of these:

DimensionTinyGPT HugeLlama 3.1 8BANE impact
head_dim32128falls off ANE matrix engine
d_model2564,096tiny matmuls under-utilize ANE tiles
vocab256 (byte)128,256 (BPE)LM-head matmul too small to matter
NormLayerNormRMSNormRMSNorm has better ANE op coverage
Positionallearned absoluteRoPEANE’s fused-attention paths assume RoPE
MLP activationGELUSwiGLUSwiGLU is the ANE-tuned default
Linear biasyesnobias-free fuses cleaner into matmul-add

What to build

A new ModelConfig preset — mini-llama — using only existing config flags (every one of the above is already a knob):

ModelConfig(
    vocabSize: 32768,        // small BPE, multiple of 64
    contextLength: 2048,
    nLayers: 24,
    nHeads: 16,              // head_dim = 128
    nKvHeads: 4,             // GQA
    dModel: 2048,
    dMlp: 8192,
    useRoPE: true,
    useRMSNorm: true,
    useSwiGLU: true,
    tieEmbeddings: false,
)
// ~600M params; scale down to (1280, 16) for ~200M first cut

Plus tinygpt to-coreml exporter (~1-2 days): maps our transformer ops to CoreML’s op set, produces a .mlpackage that Instruments can profile to see whether ANE actually engages.

Realistic speedup expectations

PathRealistic tok/s
Current Huge on Mac GPU293-696
Mini-Llama (~600M) on Mac GPU~150-400
Mini-Llama on Mac ANE if it routes~400-1200 (~2-3× over its own GPU)
Mini-Llama in browser~5-20 (probably can’t load; 600M near browser ceiling)
Mac-ANE vs browser ratio30-200× depending on routing cleanliness

Probability analysis

Test 1 (ANEMLL works on Llama 3.1 on your machine) → confirms the environment but NOT that our model routes. Independent reasons it could still fail:

ANEMLL on Llama works?
├─ No  → done, environment broken
└─ Yes → environment confirmed
         └─ Build tinygpt to-coreml exporter
            └─ Convert + profile Mini-Llama
               ├─ All ops on ANE     → 🎉 ~30-50% chance, you win
               ├─ Partial split      → 🟡 ~40% chance, measure if net speedup
               └─ Nothing on ANE     → 😐 ~10-20% chance, GPU is the ceiling

Cost-benefit (honest)

ItemCostOutcome regardless of ANE result
Train Mini-Llama (200-600M)3-7 days mostly-backgroundReal Llama-architecture gallery model. Useful independently.
tinygpt to-coreml exporter1-2 days focusedReusable for any future model. Useful independently.
Profile + iterate1-3 days unpredictableEmpirical learning either way.

Total: 1-2 weeks calendar; dominated by training wall-clock.

Why queued

Apple’s actual ANE landscape (for posterity)

There is no Apple-sanctioned “private beta” for ANE inference; that phrasing was loose. The real options are the three above.


4. Research absorbed — paper × verdict

External-paper catalogue (was docs/roadmap/recent_research.md, now archived at docs/archive/recent_research.md). Each row: technique → one-line source → verdict pointing at where it lives in this codebase, or why it doesn’t.

4.1 Implemented (techniques we ship)

Alignment / preference

TechniqueSourceWhere it lives
DPORafailov et al., NeurIPS 2023tinygpt dpo
KTOEthayarajh et al., 2024tinygpt dpo --variant kto
ORPOHong et al., 2024tinygpt dpo --variant orpo
SimPOMeng et al., 2024tinygpt dpo --variant simpo
NEFTuneJain et al., NeurIPS 2023--neftune

PEFT

All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, surfaced via tinygpt sft.

TechniqueSourceWhere it lives
DoRALiu et al., 2024default in sft
GaLoreZhao et al., 2024Optimizers.swift
LoftQLi et al., ICLR 2024PeftVariants.swift
VeRAKopiczko et al., ICLR 2024PeftVariants.swift
PISSAMeng et al., 2024PeftVariants.swift
LoRA+Hayou et al., ICML 2024PeftVariants.swift
rsLoRAKalajdzievski, 2023PeftVariants.swift

Quantization

TechniqueSourceWhere it lives
GPTQFrantar et al., ICLR 2023tinygpt gptq + GPTQReader.swift
AWQLin et al., MLSys 2024AWQ safetensors reader
HQQBadri & Shaji, 2024tinygpt hqq
KIVILiu et al., 2024KV cache quantization path

Inference / efficiency

TechniqueSourceWhere it lives
Speculative decodingLeviathan et al., ICML 2023tinygpt train-heads --type medusa|eagle + decode loop
MedusaCai et al., 2024same path, head type
EAGLE-2Li et al., 2024same path, head type
StreamingLLMXiao et al., ICLR 2024attention-sink path

Architecture variants

TechniqueSourceWhere it lives
MTPGloeckle et al., ICML 2024Train.swift, docs/mtp.md
Differential TransformerMicrosoft 2024DifferentialAttention.swift, --diff-attn
Mixture of DepthsRaposo et al., 2024soft sigmoid gate (hard top-K upstream-blocked)
LASERSharma et al., ICLR 2024tinygpt laser

Optimizers

TechniqueSourceWhere it lives
SophiaLiu et al., 2023Optimizers.swift
LionChen et al., NeurIPS 2023Optimizers.swift
MuonJordan, 2024Optimizers.swift
GaLore(see PEFT)Optimizers.swift

Distillation

TechniqueSourceWhere it lives
Soft-targets distillationHinton et al., 2015tinygpt distill

Synthetic data

TechniqueSourceWhere it lives
MagpieXu et al., ICLR 2025tinygpt magpie
TinyStoriesEldan & Li, 2023dataset source

Test-time compute

TechniqueSourceWhere it lives
Best-of-NSnell et al., 2024tinygpt bon --scan

Evolution Strategies

TechniqueSourceWhere it lives
ES at scaleQiu et al., Sept 2025tinygpt es, docs/evolution_strategies.md

4.2 Cannot — blocked, parked, or skipped

🚧 Blocked by hardware

TechniqueSourceWhy parked
BitNet b1.58Ma et al., 2024Ternary from-scratch needs 100B+ tokens to validate; not differentiating at <1B params on our hardware. Park; revisit if a clear gallery-model use case appears.
FP4 training (NVFP4 / Quartet)Wang Jan 2025 · Quartet II Jan 2026Apple M-series has no native FP4 ops
FP8 trainingNeeds H100 / Blackwell

🚧 Blocked upstream

TechniqueSourceWhy parked
Hard sparse MoE routingDeepSeek-V3 familyMLX-Swift no scatter_add; soft (dense) routing ships
Real QLoRADettmers et al., 2023MLX-Swift quantized arrays don’t autograd through; manual fake-quant shipped (pedagogical, no memory win)

❌ Skipped — different family / not worth the seat

TechniqueSourceWhy skipped
Mamba / Mamba-2Gu & Dao, 2023/2024Linear-time SSM, different family; better as side-project

❌ Dropped — value-add filter (subsumed by what ships)

TechniqueSourceSubsumed by
IPOAzar et al., 2023DPO with high β regularizes equivalently
CPOXu et al., 2024DPO + BC term marginal over SimPO at our scale
Self-InstructWang et al., 2023Magpie (model’s own distribution; no seed needed)
Evol-InstructXu et al., 2024 (WizardLM)Magpie subsumes
MiniPLMGu et al., NeurIPS 2024Distill-for-pretraining — needs a teacher-student pair we don’t have
Distillation with Training WheelsFeb 2025cloud-escalate already provides the analogous “student asks teacher” deployment shape
DEITALiu et al., 2024Instruction-data quality framework — only matters once SFT corpus > 1M samples

4.3 Planned — queued for a future training run

ItemSourceWhere in §3
GRPO / DAPO (RLVR pipeline)DeepSeek-R1, Jan 2025 · DAPO, March 2025Tier 5 §5.1 — Reasoning training on a 22M model. GRPO = mental model; DAPO = implementation.
Reasoning-trace distillationDeepSeek-R1-Distill series, OpenThoughtsTier 5 §5.1 — SFT-on-traces is the first half of §5.1 before RLVR
Snell test-time-compute scaling experimentSnell et al., 2024Tier 5 §5.2 — bon shipped; the scaling-curve experiment at 22M matches Snell methodology
Vision-language toyLLaVA familyTier 5 §5.3
Diffusion LM micro(multiple)Tier 5 §5.4
Real sparse MoE kernelsDeepSeek-V3 styleTier 5 §5.5 (also upstream-blocked on scatter_add)
TTS toyVALL-E / MusicGen familyTier 5 §5.6

Small additions, no current owner — append when a slot opens:

ItemSourceEffort
LISA optimizerPan et al., 2024~1 day; layerwise importance sampling, drop-in alongside Sophia/Muon
MiniLLM KL variantsGu et al., ICLR 2024~1-2 days; reverse-KL / skew-KL switches on top of existing tinygpt distill
Distilling Step-by-StepHsieh et al., ACL 2023~1-2 days; rationale-distillation recipe on top of tinygpt distill
DoReMi data-mixture optimizationXie et al., NeurIPS 2023Park until ≥3 distinct domains are mixed at non-trivial scale
Quality classifier (FineWeb-Edu-style)Penedo et al., 2024 — FineWeb / FineWeb-Edu§3 B10 — ~2 days; tiny fastText scorer + top-X% filter
WSD schedule (warmup-stable-decay)MiniCPM, Hu et al., 2024 · SmolLM blog§3 B11 — ~half-day; decay phase doubles as annealing
Interp-on-checkpoints methodologyPythia, Biderman et al., 2023 · OLMo, Groeneveld et al., 2024§3 B13 — 1-2 days infra + ongoing analysis; replay SAE / MEMIT across the checkpoint timeline
Speculative decodingLeviathan et al., ICML 2023 · Chen et al., 2023§3 B14 — 2-3 days; Mini-Llama draft for Mega; numerics gate required
Layer-wise LR decay (SFT)ULMFiT, Howard & Ruder, 2018§3 B15 — ~half-day flag add on existing optimizer
M5 GPU Neural Accelerator prefill benchmarkApple ML Research, 2026§3 B16 — ~half-day; verify the claimed 3.5× M5-vs-M4 prefill speedup is materializing on our path
SAE Lens interop / Neuronpedia format exportdecoderesearch/SAELens§3 B17 — ~2 days for format-export option; compare-and-decide before building
nanochat-style --depth single-knob HP derivationkarpathy/nanochat§3 B18 — ~1 day; one knob auto-derives width / heads / LR / batch / steps; UX win
Group-SAE (layer-group SAE training)Wang et al., 2024§3 B19 — 2-3 days; trains SAEs once per layer-group instead of per-layer; cuts SAE training cost
Learnable cross-stream attention (modded-nanogpt speedrun trick)KellerJordan/modded-nanogpt§3 B20 — read-and-evaluate; speedrun-specific, not yet a paper
ScaleDown extractive context compression SLMScaleDown blog · Challenge leaderboard · scaledown.ai§3 B25 — 3-5 days; token-level relevance head + sentence aggregation; submit to public leaderboard as a “specialist trained on a Mac” proof-point
Micro-AutoMixer for specialist data mixesPoolside Laguna deep dive · RegMix/DoReMi-style mixture search§3 B21 — small proxy-run version of Poolside’s automixing; optimize specialist ratios before full training
Token-preserving agent trajectory recorderPoolside Laguna deep dive§3 B22 — preserve token IDs through rollout → training so agent traces cannot drift through retokenization
Agent eval protocol hardeningPoolside Laguna deep dive§3 B23 — repeated pass@1, fixed step/resource/sampling budgets, and explicit infra-patch notes
Muon large-scale re-benchmarkPoolside Laguna deep dive · Jordan, 2024§3 B24 — only revisit if large/proxy matmul-dominated runs amortize Newton-Schulz overhead

4.4 Reference reads (no verdict — context only)

For mental-model framing, not techniques to implement:

2026 small-model peers (for positioning, not adoption): SmolLM3-3B · Qwen3.5-0.8B · Phi-4-mini-instruct · Gemma-3n-E2B-IT · Gemma-4-12B Unified (encoder-free multimodal, 256K ctx, MLX variants exist). Implication: the niche is “browser-trainable + every byte of training code is here,” not “perf-competitive with Phi-4.”

Direct from-scratch peers (full pipeline, not just pretrain):

Tools worth knowing:

Apple Silicon ecosystem (direct peers on our platform):

Interpretability ecosystem (overlap with our interp lab):

Proprietary / out of scope: OpenAI o1 / o3 (closed-weights; reframed the field around test-time compute, no adoptable artifact). DeepSeek-V3 (671B-MoE, scale-blocked; informs MTP + MoE design). Qwen3 (model family, not a technique).

4.5 Coverage cutoff

The catalogue was hand-curated up to assistant knowledge cutoff January 2026 plus best-effort web-search additions for Feb-May 2026 (coverage spottier there). Today is 2026-06-04.

2026-06-04 web sweep folded in — five surfaces were checked (Apple Silicon training, nanoGPT successors, Mac inference runtimes, interpretability libraries, fine-tune frameworks). Results: nanochat

Future papers append row-by-row into §4.1 / §4.2 / §4.3.


Appendix — index of source docs absorbed by this file

This doc replaces the multi-file roadmap split. The source docs are kept for context but should be treated as historical; edit this file, not them.

Old docWhat it coveredStatus
docs/roadmap/index.mdTOC for the multi-file splitSuperseded — point at this file
docs/roadmap/tier1.md / tier2.md / tier3.mdROI-tiered technique inventoryAbsorbed; markers refreshed
docs/roadmap/tier4_skip.mdIntentionally-not-built itemsAbsorbed into §2
docs/roadmap/tier5_frontier_2026.md2026 research frontierAbsorbed into §3 Tier 5
docs/roadmap/categories.mdOrthogonal technique taxonomy (had stale markers)Absorbed; refreshed against code
docs/roadmap/blockers.mdWhat we can’t build + Phase 9/10 status appendixAbsorbed into §2 + §1
docs/roadmap/phased_plan.md7-week sequential planMostly shipped; remainder in §3
docs/roadmap/recommended_order.mdTop-10 nextSuperseded by Tier A/B ordering in §3
docs/roadmap/honest_summary.md”CAN / CAN’T / SHOULDN’T” framingAbsorbed
docs/progress.mdMac+Web shipped dashboardAbsorbed into §1
docs/backlog.mdROI-ordered “what’s left” (Tier A/B/C/D)Absorbed into §3
docs/feature_audit_2026_05_31.mdCLI smoke auditCross-referenced; was the verification baseline
docs/roadmap/recent_research.mdPaper catalogue (2024-2026)Absorbed into §4; archived at docs/archive/recent_research.md

Still canonical (deep dives, not absorbed): docs/roadmap/datasets.md, docs/roadmap/north_star_refined.md, and the per-technique docs (distillation.md, interpretability.md, moe.md, mtp.md, lora_guide.md, precision.md, memory_tradeoffs.md, perf_quest.md, decision_log.md). Those don’t duplicate planning — they explain how shipped pieces work.