TinyGPT — master plan

Last verified against codebase: 2026-06-06 (eval-pipeline + serve fix + elf PRDs landed; product framing clarified to “Mac platform for building/upgrading specialists”) Sources merged: docs/roadmap/* · docs/progress.md · docs/backlog.md · docs/feature_audit_2026_05_31.md · docs/roadmap/recent_research.md (paper catalogue → §4)

Product framing (clarified 2026-06-06): TinyGPT is a Mac platform for individuals to build and upgrade specialist models for their specific tasks — bring data, pick a local teacher, ship a fast/cheap specialist. Distillation + LoRA + QLoRA + constrained decoding are the toolkit. Local teacher = no API spend. Comprehensive multimodal roadmap (text/code/vision/voice/image-gen) under disciplined “one canonical best per slot” principle. Canonical strategy doc: docs/sessions/2026-06-06-mac-specialist-platform.md — covers Tier 1-4 backlog, multi-model architectures (phone-a-friend / cascade / LoRA hot-swap / etc.), structured-output formats beyond JSON (incl. Protobuf / SQL / GraphQL via grammar), and flagship example apps (browser agent, per-language code specialist, voice command, etc.).

Three sections — shipped, skipped, TODO. Every claim verified against the code. The first audit caught Lion/Sophia/Muon/PEFT-bundle/ gradient-clipping; the second caught YOCO + GPTQ-reader + token-elim (dropped under value-add filter); the third caught embedding RMSNorm, cosine warmup, layer-wise LR decay, DeepNorm, BPE-dropout, Real CI — all shipped, all previously marked ⬜.

Status legend

Mark	Meaning
✅	shipped — verified against code today
🟡	partial / in-session-only / verified-with-caveat
⬜	TODO — in active backlog
⏸	deferred — would build but waiting on external trigger
❌	skipped — intentionally not built (better alternative exists)
🚧	blocked — would build but cannot right now (hardware / upstream / budget)

1. SHIPPED

Mac runtime + CLI

Audit baseline: every CLI smoke-tested on M5 Pro 2026-05-31. See feature_audit_2026_05_31.md for the full smoke trace. 30+ subcommands all green.

✅ Cold-start bundle (mmap + lazy embed + async load + compile cache) — 24 ms in-process TTFT on 1B
✅ KV cache (GQA + in-place + persistent across sessions)
✅ Pausable training (cooperative SIGINT + atomic save + --resume)
✅ Cross-process GPU lock (~/.cache/tinygpt/gpu.lock)
✅ CF R2 cloud save/load pipeline (push / pull / list / setup; zero egress)
✅ tinygpt serve — OpenAI + Ollama surfaces on the same socket
✅ tinygpt agent — multi-turn + tool dispatch + persistent KV + --cloud-escalate
✅ JSON-mode constrained generation (FSM token masking)
✅ Cloud API client (Anthropic + OpenAI via curl) + SSE streaming + cancellation
✅ Continue.dev / Ollama-compat provider (/api/tags, /api/version, /api/show, /api/chat, /api/generate)
✅ tinygpt escalate (direct cloud-API call)

Mac training + post-training

✅ Pretrain (tinygpt train) — 42 ms/step Huge on M5 Pro, 17.2× browser
✅ Finetune (tinygpt finetune)
✅ SFT (tinygpt sft) — DoRA default + every PEFT variant
✅ DPO / SimPO / KTO / ORPO (all in tinygpt dpo via flags)
✅ Knowledge distillation (tinygpt distill) — KL teacher → student
✅ Speculative-decoding head training (tinygpt train-heads --type medusa|eagle)
✅ Evolution Strategies trainer (tinygpt es)
✅ Tuned-lens trainer (tinygpt tuned-lens)
✅ Mini-router trainer (tinygpt train-extractor)
✅ Magpie synthetic-instruction generator (tinygpt magpie)
✅ Sequence packing for SFT
✅ NEFTune (noisy embeddings) — --neftune-alpha in sft + dpo (matches the paper’s “Noisy Embeddings Improve Instruction Finetuning” scope; not in the pretrain path)
✅ Gradient clipping (--grad-clip F, default 1.0, on train + sft + dpo)
✅ z-loss auxiliary (--z-loss-weight F)
✅ Embedding tying (tieEmbeddings config flag)
✅ Document-level shuffling (implicit via batch sampler)
✅ Gradient checkpointing (CustomFunction VJP workaround for missing mlx_checkpoint)
✅ QAT (in-training, --qat)
✅ Persistent tokenized cache (TokenCache.swift; wired into Train + Eval + Distill + Finetune)

Training stability (verified 2026-06-02 — these were all marked ⬜ in older docs)

✅ Embedding RMSNorm (--embedding-rmsnorm / cfg.useEmbeddingRMSNorm)
✅ DeepNorm residual scaling (--deep-norm / cfg.useDeepNorm + deepNormAlpha/Beta)
✅ Layer-wise LR decay (cfg.lrLayerDecay)
✅ Cosine warmup (--lr-schedule cosine --warmup 500 — the curated default)
✅ BPE-dropout (BPEDropout.swift — per-merge skip during encoding for regularization)

PEFT bundle

All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, all gated through tinygpt sft:

✅ LoRA · Multi-LoRA composition · LoRA+ (different LR for A/B)
✅ DoRA (in-session; on-disk format pending — see Tier C)
✅ VeRA · LoftQ · AdaLoRA · RsLoRA · PISSA · LoRA-FA · LayerDrop

Inference + sampling

✅ KV cache + flash-attention forward (MLXFast.scaledDotProductAttention) + backward
✅ Quantized inference (int4 / int8 via MLXNN.quantize)
✅ Speculative decoding (vanilla + Medusa + EAGLE-2 heads)
✅ Prefix / prompt caching
✅ Streaming-LLM attention sink
✅ KV cache quantization (KIVI)
✅ Multi-Token Prediction (MTP) inference path
✅ Multi-Query Attention (free via nKvHeads: 1)
✅ Sliding window attention (--sliding-window N)
✅ ALiBi position bias (--alibi)

Quantization + compression

✅ HQQ (tinygpt hqq — int4 q-then-dq, 0.087 rel error)
✅ GPTQ (tinygpt gptq — from-scratch int4 quant of own model, 0.102 rel error)
✅ AWQ safetensors reader (loads HF AWQ-quantized models)
✅ GPTQ safetensors reader (GPTQReader.swift — loads HF GPTQ-format models; tested 72 tensors quantised in 31s)
✅ GGUF reader (GGUFReader.swift + tinygpt gguf-inspect) — parses v2/v3 header + metadata + tensor inventory; dequantises F32 / F16 / Q4_0 / Q8_0 tensors to fp32. K-quants (Q4_K / Q6_K / etc.) slot into the same switch when needed.
✅ SmoothQuant (in-training)
✅ Pruning — unstructured (tinygpt prune-unstructured) + structured (tinygpt prune-structured)
✅ LASER selective rank reduction (tinygpt laser)

Optimizers

✅ AdamW · Lion · Sophia · Muon · Adafactor (all in Optimizers.swift)
✅ GaLore (gradient low-rank projection)

Architecture variants

✅ Standard transformer (RoPE + RMSNorm + SwiGLU + GQA)
✅ Sliding window · ALiBi · Multi-Token Prediction · MQA · GQA
✅ MoE (dense routing — sparse hard routing blocked, see §2)
✅ Mixture of Depths (soft sigmoid gate — hard routing blocked, see §2)
✅ Differential attention (--diff-attn)
✅ YOCO cross-layer KV sharing (--yoco) — CrossAttention.swift module, second-half blocks reuse first-half K/V. See docs/yoco_results.md. (Was marked “designed only” in older audit — actually shipped.)

Tokenization

✅ Byte-level (vocab=256) — from-scratch path
✅ HF BPE / SentencePiece via swift-transformers

Interpretability tools (browser playground)

✅ Logit lens (button + worker route)
✅ Tuned lens (Mac CLI trainer + .lenses sidecar + browser upload)
✅ Attention heatmap (“Watch the model think” panel)
✅ Per-layer ablation (“Ablate & sample” button)
✅ Activation patching — both variants (zero + donor-swap, shipped 2026-06-02 in 17021bc)
✅ Linear probes (tinygpt linear-probe) — train Linear(d_model → C) on per-layer hidden states + label data; .lp sidecar format. Detects whether a layer represents an arbitrary external property (Alain & Bengio 2016).
✅ ROME — surgical fact editing (tinygpt rome). Rank-1 update to one MLP’s W_out, identity-Hessian first cut. Verified on shakespeare.tinygpt: --target X --layer 11 --scale 10 flipped sampled next-token to X. Covariance-based ROME is the follow-up.
✅ MEMIT — batched fact editing (tinygpt memit). Rank-K least-squares ΔW = R(KᵀK + λI)⁻¹Kᵀ via hand-rolled Gauss-Jordan inverse on the small N×N system. Verified math: per-fact residual ~1e-4 at scale=1 (machine noise — least-squares is exact). Single-layer visibility-in-sampling tradeoff documented; multi-layer MEMIT (distribute update across 5-7 mid-network layers) is the next-cut.

Browser / Web track

✅ Landing page + /playground route
✅ WebGPU training pipeline (Huge / Mega presets via capability detection)
✅ Browser BPE scorer + gallery model loader
✅ Browser-side benchmark runner (“Run benchmark on your loaded model”)
✅ Doc consolidation — every doc visible at /docs/[slug]
✅ WASM SIMD (-msimd128) — measured 1.6×
✅ Multi-threaded WASM (pthreads + SAB) — measured ~2×
✅ Memory64 module (tinygpt64.{js,wasm}) — partial: Node ok, browser blocked at d_model ≥ 256 (ABI bug, task #66)
✅ Speedup curve vs WASM SIMD: Small 2.6× / Medium 6.8× / Large 9.3× / XL 12.1×
✅ WebNN active probe (webnn_probe.ts, builds a tiny MLGraph and verifies it computes, drives the +WebNN (gpu/npu) pill state — 2026-06-02 in 86433c3). Full transformer-as-MLGraph follow-up unblocked.

WebGPU kernels (in `webgpu/train*.wgsl`)

✅ Naive scalar matmul
✅ Blocked 4×4 matmul (matmul_blocked_vec4)
✅ Layer-norm subgroup variant (gated on gpuFeatures.subgroups)
✅ Cross-entropy subgroup variant
✅ Bias-grad subgroup variant
✅ FA2 forward in WGSL (flash attention in browser)
✅ f16-storage matmul (gated by verifyF16Storage)
✅ f16-compute matmul forward + backward (train_f16_compute.wgsl, gated by verifyShaderF16Compute — 2026-06-02 in 1ddf6ba / 2cdedac)
✅ Coop-matrix matmul (train_coopmat.wgsl, gated by verifyCoopMatrix — 2026-06-02 in 86433c3)
✅ OPFS persistence
✅ Patch kernels (patch_zero + patch_replace — 2026-06-02)
✅ Subgroup matmul kernel (matmul_sg / matmul_abt_sg — gate currently fails on M5 Pro, falls back to vec4)

Numerics-gate framework — every fast path (f16-storage, f16-compute, coop-matrix, subgroup) carries its own gate that compares against a f32 reference with a magnitude-aware tolerance. Gate-fail → silent fallback, zero regression risk. See docs/precision.md.

Datasets + data pipelines

✅ tinygpt list-datasets — 22 curated entries (tool-calling / debugger / code / math / reasoning)
✅ tinygpt download-dataset (canonical hf://datasets/owner/name form)
✅ HF Datasets / Hub integration (hf-load, hf-inspect)
✅ GitHub data fetcher (fetch-github — issue→PR pairs)
✅ Magpie synthetic instruction generator
✅ Extractor-data pipeline (extractor-data — BFCL/τ-bench → {query, tool} pairs)
✅ Indic eval pipeline (eval-indic — MILU MCQ + IndicGenBench-XQuAD, smoke-validated)

Tooling + infra

✅ XCTest harness + swiftformat + lint CI (Mac)
✅ tinygpt inspect / validate (round-trip byte-compare verified on 110 MB model)
✅ tinygpt bench (TTFT/ITL/decode tok/s/peak RSS) + tinygpt bench-train
✅ tinygpt eval / score-bench (loss + benchmark scorers)
✅ tinygpt compare (side-by-side base vs LoRA-adapted)
✅ tinygpt debug-* (dtypes / load / logits / loss / names helpers)
✅ tinygpt screen tree (AX tree readout — focused-window JSON)
✅ lm-evaluation-harness MLX adapter

Headline metrics (Mac, M5 Pro / 48 GB)

	Value	Target	Headroom
TTFT (warm)	5.8 ms p99	< 50 ms	✅ 10× under
ITL p99	4.9 ms	< 30 ms	✅ 6× under
Decode tok/s	293 (mega-pilot 960M) → 696 (huge 221M)	> 50 tok/s	✅ 6× over
Cold start TTFT	24 ms (1B)	< 50 ms	✅ 2× under
Training Huge	42 ms/step	(baseline)	—
Speedup vs browser	17.2×	(baseline)	—
Largest model	960 M params (1.1 GB)	—	—

Learning artifacts (docs)

✅ docs/decision_log.md — every architectural decision logged
✅ Research bundles in docs/research/ (inference + quality benchmarks, kernel audit, mac decode baseline, wave-4 landscape, Indic evals)
✅ Session retrospectives (e.g., session_2026_05_31.md)
✅ Per-technique deep-dives (distillation.md, evolution_strategies.md, moe.md, mtp.md, lora_guide.md, interpretability.md, etc.)

2. SKIPPED

❌ Superseded by better alternatives

fp16 mixed-precision training — bf16 strictly better, shipped
ZeRO / FSDP / pipeline parallelism — multi-device only
State space models (Mamba / RWKV) — different architecture; ~2-3 week port; better as side-project
PagedAttention / continuous batching — multi-user inference only
Tree attention / lookahead decoding — marginal over speculative
Adapter modules (Houlsby / Pfeiffer) — LoRA’s older cousin, superseded
BitFit — train biases only; quality is poor
IA³ — element-wise scaling; superseded by LoRA family
Hyena / long-conv — different architecture
fp8 training — needs H100 / Blackwell hardware

❌ Dropped after audit (real cost, no payoff at our scale)

Flash Attention Metal kernel — MLXFast SDPA already fused (docs/research/wave_2_5_kernel_audit.md §1)
Int4 packed matmul Metal kernel — MLX quantized_matmul already hand-tuned (§3)
General SWE-bench leaderboard chase — Sonnet 4.6 dominates regardless of wrapper; play local-first / on-device game instead
Tinker cloud fine-tune as differentiator — use if needed; not a project differentiator (budget-ruled-out for solo)
Hooking into Apple App Intents — no public API for third-party LLMs to replace Apple’s FM

⏸ Deferred (waiting on external trigger)

Item	Trigger	Why deferred
cider W8A8 adoption	a 3B+ specialist ships	At ≤ 1B, Mac already 10× under realtime; cider’s prefill win is immaterial
ANE + GPU heterogeneous routing	Apple ships Stateful Models API (rumored late 2026)	Research-grade; current path uses private ANEMLL APIs
WebGPU subgroup matmul redesign	browser focus returns	Current gate fails (1415% mean_rel); fallback works
Vision encoder (ViT → tinygpt decoder)	vision-specialist demand becomes concrete	2-week research-grade work; not critical-path
Audio I/O (Speech.framework + AVSpeechSynthesizer)	voice-mode demo becomes priority	Not in scope for Wave 3
Async tool-call dispatch	parallel-tool specialist ships	LM dominates 5-100× over subprocess at current scales
ScreenCaptureKit raw image (CGS-init fix)	vision specialist needs raw bytes	AX tree sufficient for tool-calling specialists
Public launch (HF + writeup + HN)	≥ 1 specialist beats a fair baseline	Nothing to launch yet
Phase 7 browser perf (subgroups / coop-matrix / WebNN)	post-HN v2 push	Current 12.1× lift is the launch story

🚧 Blocked by hardware

Distributed training (ZeRO, FSDP, pipeline-parallel) — single device only
Native FP4 training — Apple M-series lacks FP4 tensor ops
Native FP8 training — same
Hardware-accelerated MoE routing — Apple silicon has no sparse-routing ops
ANE training acceleration — ANE is inference-only

🚧 Blocked by upstream library state

Item	Blocker	Workaround
QLoRA real-quantized base + LoRA	MLX-Swift quantized arrays don’t autograd through	Manual fake-quant in fwd (pedagogical, no memory win)
Sparse MoE hard routing	MLX-Swift no `scatter_add`	Soft (dense) routing shipped
Mixture-of-Depths hard top-K	same	Soft sigmoid gate shipped
Fast BPE encoding	swift-transformers single-threaded; 2 GB corpus = ~30 min	Rust-backed encoder via FFI (future)
Native int4 / int8 WebGPU matmul	spec doesn’t yet have quantized matmul extensions	Wait for subgroup / coop-matrix extensions
GGUF safetensors reader	not yet written	Could write (~2 days); AWQ + GPTQ readers already ship

🚧 Blocked by budget

Synthetic SFT via frontier API ($1-10K) — use open-weights teacher via Magpie instead
Multi-TB dataset downloads — stream subsets (the HF importer does this)
Strong local judge for Constitutional AI / RLAIF — no 70B+ runs usable on a Mac
Public RLHF / PPO pipeline — 5× the code of DPO + 10× iteration; DPO covers 80-90% of the lift

3. TODO

ROI-ordered. Sourced from backlog.md (the living list, last sort 2026-05-31).

Tier A — DO NEXT (north-star aligned; specialists)

Until A1 lands, every optimization is theoretical. Until Tier E (eval pipelines) lands, every specialist is unmeasurable — A1’s “ship” criterion implicitly requires E1 + E3 wired before any score can be published. Sequencing: Tier D (data) + Tier E (evals) → A1 specialist → Tier B follow-ups.

⬜ A1. Train first specialist end-to-end (tool-caller) — 3-5 days execution + GPU hours. Validates north-star thesis.
✅ A2. Pull foundational datasets (DONE 2026-06-17) — all on disk: xlam-function-calling-60k, hermes-fc, function-calling-chatml, SWE-bench_Verified, alpaca-cleaned, orca_dpo_pairs, MetaMathQA, ultrafeedback, the-stack-smol (8 langs), python_code_instructions_18k_alpaca, all under ~/.cache/tinygpt/datasets/. Inventory: docs/dataset-inventory.md.
⬜ A3. Fetch GitHub issue→PR corpus for debugger — ~1 day with GITHUB_TOKEN
⬜ A4. Pull BFCL + τ-bench via extractor-data — ~30 min (DONE — sources at ~/.cache/tinygpt/datasets/_external/{gorilla-bfcl,tau-bench}/; wiring is Tier E, not Tier A)
⬜ A5. Pull Indic eval datasets (MILU + IndicGenBench-XQuAD) — ~30 min (DONE — MILU is lm-eval-harness, source at _external/MILU/; wiring → E3)
⬜ A6. Dataset inventory doc — ~30 min after A2-A5
⬜ A7. Real-data MILU baseline on flagship-huge-v5 — ~2 hr; depends on A5 + E3

Tier D — DATA (gaps blocking specialists)

Pulled today: hermes-fc.jsonl, ultrafeedback.jsonl, MetaMathQA, alpaca-cleaned, orca_dpo_pairs, FineWeb-Edu (50K-row sample via parquet decoder). Blocked / missing for the planned specialists:

✅ D1. xlam-function-calling-60k (DONE 2026-06-17) — ~/.cache/tinygpt/datasets/Salesforce/xlam-function-calling-60k/xlam_function_calling_60k.json (91.7 MB, ~60K rows). Required both HF_TOKEN and per-account license click-through at the dataset page.
✅ D2. function-calling-chatml + SWE-bench_Verified (DONE) — ~/.cache/tinygpt/datasets/Locutusque/function-calling-chatml/ (102 MB parquet) + princeton-nlp/SWE-bench_Verified/ (2 MB).
✅ D3. MS-MARCO + Natural Questions subset (DONE 2026-06-17) — microsoft/ms_marco/v1.1/ (3 shards, 207 MB: test+train+val) + google-research-datasets/natural_questions/default/ (2 train shards of 287, 375 MB — subset bounded for B25 training data; full corpus is multi-GB).
✅ D4. the-stack-smol + python_code_instructions_18k_alpaca (DONE 2026-06-17) — python alpaca: iamtarun/python_code_instructions_18k_alpaca/ (10.8 MB parquet, 18 612 rows decoded). the-stack-smol: 8 languages pulled (c 117 MB, c++ 147 MB, go 107 MB, java 67 MB, javascript 130 MB, python 83 MB, rust 132 MB, typescript 69 MB ≈ 850 MB total). Required license click-through at the dataset page.
✅ D5. GSM8K + MATH + HumanEval + MBPP eval splits (DONE 2026-06-17) — openai/gsm8k/main/ (test+train parquet) + HuggingFaceH4/MATH-500/test.jsonl (500 rows, the canonical math eval set) + openai/openai_humaneval/ + google-research-datasets/mbpp/full/ (prompt+test+train+val).

Tier E — EVAL PIPELINES (wire harnesses → automate scores)

Source code for BFCL / τ-bench / lm-eval-harness is already on disk under ~/.cache/tinygpt/datasets/_external/. Pulling source ≠ usable evaluator. Each item below is the wiring work — a tinygpt eval-<name> subcommand that takes a model path, runs the harness via subprocess, parses the score JSON, returns a clean number. Until these land, “did the specialist learn anything?” has no automated answer.

Architectural constraint (decided 2026-06-05): every E* item MUST emit structured JSONL conforming to a shared eval schema (E0). That makes two critical comparisons possible:

Cross-model: TinyGPT vs SmolLM2 vs Qwen3 vs Phi-mini on the same task — without this, “we trained a model” doesn’t answer “is it any good?”
Cross-checkpoint (training dynamics): every save-history checkpoint scored against the same task → see WHEN a capability emerges. Pairs with B13 interp-on-checkpoints — interp explains WHY features appeared, eval confirms IF they’re useful.

Both fall out for free if E0 + E8 are designed in, not retrofitted.

✅ E0. Shared eval JSONL schema + tinygpt eval-compare (SHIPPED 2026-06-05) — Sources/TinyGPT/EvalCompare.swift. Codable Row with snake_case JSON. Three view modes: --by step / --by model / --by task. Sample artifact at docs/artifacts/emergence-smoke-2026-06-05.jsonl.
✅ E1. tinygpt eval-bfcl <model> (SHIPPED 2026-06-05) — Sources/TinyGPT/EvalBFCL.swift. Boots tinygpt serve, invokes bfcl_eval._llm_response_generation + bfcl_eval.eval_checker.eval_runner via subprocess with OpenAI-compatible base URL. Default 10 BFCL categories. PRD: docs/prds/E1-bfcl-eval.md. Unblocks A1.
✅ E2. tinygpt eval-tau-bench <model> (SHIPPED 2026-06-05) — Sources/TinyGPT/EvalTauBench.swift. Retail + airline envs. Configurable user simulator model. PRD: docs/prds/E2-tau-bench-eval.md.
✅ E3. tinygpt run-lm-eval <model> (SHIPPED 2026-06-05) — Sources/TinyGPT/RunLmEval.swift. Two modes: --hf-model <id> (baseline scoring via HF transformers) and --tinygpt-model <ckpt> (boots tinygpt serve + routes lm-eval via local-completions for our actual forward pass). tinygpt serve learned scoreLogprobs for echo+logprobs requests. Smoke-tested cross-checkpoint + cross-model emergence sweep.
⬜ E4. tinygpt eval-gsm8k <model> — standalone scorer. Parse model’s final numeric answer, compare to gold. Tiny — covered by E3 if lm-eval-harness lands, but a standalone fallback gets you a number in ~half-day if E3 slips. May be unnecessary — E3 via local-completions should handle gsm8k; will validate on first post-N02 sweep.
✅ E5. tinygpt eval-humaneval <model> + sandbox (SHIPPED 2026-06-05) — Sources/TinyGPT/EvalHumanEval.swift + Rust crate at scripts/humaneval-sandbox/ (macOS sandbox-exec policy at macos-sandbox.sb). HumanEval + MBPP suites. PRD: docs/prds/E5-humaneval-sandbox.md.
⬜ E6. tinygpt eval-scaledown <model> — clone ScaleBench, wire to TinyGPT-loaded model, run. Prereq for B25 submission. ~half-day after E1’s subprocess pattern is the template. See docs/recipes/b25-scaledown.md for the training-side plan.
✅ E7. tinygpt judge <out.jsonl> --judge-model <model> (SHIPPED 2026-06-05) — Sources/TinyGPT/JudgeShim.swift. Two modes: pairwise (chosen-vs-rejected) and rate (1-10 score). PRD: docs/prds/E7-judge-shim.md.
✅ E8. Train-time eval hook + dashboard plot (SHIPPED 2026-06-05) — --eval-every N --eval-tasks csv --eval-limit N flags in tinygpt train. Spawns background run-lm-eval per checkpoint, appends to <out-stem>-evals.jsonl. Non-blocking; skips if previous eval still in flight. PRD: docs/prds/E8-train-time-eval-hook.md. Post-training equivalent: scripts/score-run.sh.
❌ E9. Prompt-tiering A/B on the planner unhappy suite — RAN 2026-06-13 against google/gemma-3-12b on the n=130 drill. Hypothesis refuted. Compact (name-only) action index made every dim worse: ambig 11/40 → 7/40 (-10pp), oos 51/60 → 47/60 (-6.7pp), destructive 24/30 → 23/30 (-3.3pp). The failure-pattern diff is mostly silent (one ambig pattern shrank 12→11; one new ⚠ pattern in B) — meaning compact mode doesn’t introduce new failure modes, it just makes existing intent-mismatch confusions worse. Interpretation: at our scale (12B + 12-action surface), Gemma needs the schemas to confidently route non-action intents; the schemas are evidence for what kinds of requests pace can do, which sharpens both “this is an action” and “this is NOT an action” judgments. Steal #2 from the Shortcut essay is not portable to Pace as stated. v11 stays the default; v11-compact retained at grammars/pace-system-prompt-v11-compact.txt for re-test against larger catalogs (e.g. App Intents-class surfaces) where the schema budget shifts the tradeoff. Updated docs/learn/agent-context-hierarchy.md Steal #2 with this verdict.

Browser viewers shipped 2026-06-05

Page	Role
`/eval-leaderboard.astro`	drag-drop E0 JSONL → 3-view comparison (by step / model / task)
`/sae-timeline.astro`	drag-drop B13 SAE timeline JSONL → MSE-over-step + L0-over-step charts

Rust performance tools shipped 2026-06-05

Crate	Role
`scripts/parquet-decoder/`	replaces `python3 scripts/parquet_to_txt.py`; static binary, no pyarrow
`scripts/hf-downloader/`	parallel HF shard fetches with progress + retry + resume
`scripts/humaneval-sandbox/`	E5 supporting sandbox runner (Rust + macOS sandbox-exec)

Eval — runbook artifacts shipped 2026-06-05

Script	Role
`scripts/score-checkpoint.sh`	one `.tinygpt` → E0 JSONL row(s) via lm-eval
`scripts/score-run.sh`	every history checkpoint of a run + SmolLM2 baseline → JSONL + 3-view summary
`scripts/sae-run.sh`	SAE-per-checkpoint sweep → JSONL timeline (B13 v2 input)
`scripts/score-baselines.sh`	5 HF baselines (SmolLM2-135/360M, Qwen3-0.6B, TinyLlama, Phi-3-mini) on the same task set

Total Tier E: ~6-8 focused days. Do E0 first (schema is everyone’s dependency), then E3 (highest harness leverage), then E1 (A1 ship-blocker), then E8 (multi-checkpoint), then the rest as nightly arcs.

Tier B — NEXT QUARTER (multi-specialist + product)

⬜ B1. Second specialist (shell or SQL) — 3-5 days; depends on A1
⬜ B2. Mini-router on real BFCL data — ~half day after A4
⬜ B2b. Bake-off — classifier-head router vs pure-GPT-with-FSM — settles whether architectural deviation is justified
⬜ B3. FSM constraint-injection from router prediction — ~3 days; depends on B2
⬜ B4. Tool-call eval harness (subprocess refactor for BFCL/τ-bench) — ~half day
⬜ B5. Cloud-escalation training signal ({"defer_to_cloud": true}) — ~1 week
⬜ B6. Mac app demo — ~1 week; depends on A1
⬜ B7. Specialist routing model — 1-2 weeks; depends on B1
⬜ B8. Multilingual specialist (Sarvam-Edge / Airavata base) — 1-2 weeks; depends on A7
⬜ B9. Energy J/token measurement (needs sudo for powermetrics) — ~1 day

Pretrain + runtime quality (added 2026-06-04 — “good product” lens, not launch optics):

⬜ B10. Quality classifier on pretrain data (FineWeb-Edu-style) — tiny fastText classifier on educational-quality labels, score corpus, keep top X%. Highest direct quality lift per dev-day. ~2 days. See §4.3.
✅ B11. WSD schedule (warmup-stable-decay) (SHIPPED) — --schedule wsd --decay-steps N in Sources/TinyGPT/Train.swift. Linear warmup → stable plateau → linear decay. Replaces cosine; the decay phase IS the annealing knob.
✅ B12. Loss-spike recovery + replay (SHIPPED) — spike detector on by default; --no-spike-detect opts out. Grad-norm tracker triggers auto-rollback + LR drop. In Sources/TinyGPT/Train.swift.
🟡 B13. Interp-on-checkpoints (partial — --save-every N shipped in tinygpt train; multi-checkpoint replay tooling still pending) — replay SAE / MEMIT / tinygpt patch across the multi-checkpoint timeline. Checkpoint emission is in code; the analysis-side batch driver is the open part. See §4.3.
✅ B14. Speculative decoding (Mini-Llama draft for Mega) (SHIPPED) — Sources/TinyGPT/SpeculativeDecode.swift implements Leviathan et al. 2023 (simplified). Greedy speculative; speedup is K-ish on benign branches.
✅ B15. Layer-wise LR decay for SFT (SHIPPED) — --lr-layer-decay F flag in tinygpt train. Each block’s gradient is multiplied by factor^(L - 1 - i) so deeper layers get the full LR. Sources/TinyGPTModel/Trainer.swift exposes lrLayerDecay as graph-pure scalar multiply per leaf; smart default 0.85 in certain training modes.

Competitor-aware additions (added 2026-06-04 — surfaced by web sweep, not Jan-2026 cutoff knowledge):

⬜ B16. M5 Neural Accelerator prefill benchmark + bump — verify the claimed 3.5×–4× M5-vs-M4 prefill speedup is materializing on TinyGPT’s MLX path. Current pin: mlx-swift 0.31.3 on macOS 26.5 / M5 Pro (well past the 26.2 floor). Bump to latest (0.31.4) and benchmark. ~half-day. Free win if it’s already on; bump is reversible. See §4.3.
✅ B17. SAE Lens interop / Neuronpedia format export (SHIPPED, option c) — tinygpt sae-to-saelens <in.sae> --out <dir> converts to SAELens (decoderesearch/SAELens) on-disk layout; Neuronpedia consumes the same. Sources/TinyGPT/SaeToSaelens.swift.
✅ B18. nanochat-style --depth single-knob HP derivation (SHIPPED) — --depth N in Sources/TinyGPT/Train.swift derives the GPT-2-shaped width / heads / LR / batch / steps from one knob.
✅ B19. Group-SAE (layer-group SAE training) (SHIPPED) — tinygpt sae --layer-group A,B,C trains ONE SAE on the union of residuals across the listed layers (mutually exclusive with --layer). Provenance round-trips through SAELens export via tinygpt_is_group_sae=true metadata key.
⬜ B20. Investigate learnable cross-stream attention (EVALUATED 2026-06-17 — verdict: skip; revisit on scale or paper) — full read-and-evaluate write-up at docs/research/cross-stream-attention-evaluation.md. Gain is ~3–6% wall-clock at speedrun scale on FineWeb; not visible at our scales, interacts non-trivially with B14 spec-decode + GaLore + DoRA TGLA + ANE M8, and not yet a paper. Revisit if (a) TinyGPT ships a from-scratch ≥50M FineWeb-class run, (b) the speedrun PR gets a formal ablation write-up, or (c) the interaction-surface items land formally so the boilerplate is paid.
⬜ B21. Micro-AutoMixer for specialist data mixes — Poolside-style data mixture optimization, scaled down: train 6-12 proxy runs across code/math/tool/web ratios, score on fixed capability evals, fit a simple surrogate, then propose the next mix. Do this before expensive specialist training so data ratios stop being hand-wavy. ~2-3 days plus small proxy runs. See §4.3.
✅ B22. Token-preserving agent trajectory recorder (SHIPPED + verified 2026-06-17) — tinygpt agent --trajectory-dir <dir> writes one .atraj JSON file per rollout with per-step role / decoded content / raw token IDs (input_ids for fed text, output_ids for sampled assistant text) / structured tool-call args / structured tool results / rewards. Format + reader API: Sources/TinyGPTModel/AgentTrajectory.swift. Threading: Sources/TinyGPT/AgentLoop.swift (recorder hooks at every turn boundary; finishTrajectory(summary:) flushes on session end). CLI: Sources/TinyGPT/Agent.swift (--trajectory-dir, --trajectory-task). 3 unit tests (roundtrip byte-equality of input_ids/output_ids, recorder lifecycle, empty-trajectory + auto-mkdir) — all pass. Docs: docs/agent_runtime.md §“Token-preserving trajectories (B22)”. Unblocks B29.
🟡 B23. Agent eval protocol hardening — statistical-reporting + budget-metadata slice shipped 2026-06-18: tinygpt eval-gate --passes K now gates on K-pass means and preserves per-trial scores, stdev, stderr, and 95% CI in gate-result.json; --budget evals/sample-budget.json attaches fixed max steps, sandbox resources, sampling params, seed, and infra patches under the report’s "protocol" block. Swift eval rows emitted via EvalHarnessSupport.appendRow now attach the same protocol metadata when --budget / TINYGPT_EVAL_BUDGET is present, with eval-gate forwarding TINYGPT_EVAL_PASSES. Remaining: Pace unhappy Python rows, future SWE-mini/Terminal-mini rows, eval-compare error-bar rendering, plus actual sandbox/resource enforcement. PRD: docs/prds/B23-agent-eval-protocol.md.
⬜ B24. Muon re-benchmark at 1B+ or skip — Poolside reports Muon giving a large-step efficiency win at scale with distributed overhead below 1%; TinyGPT’s current Muon smoke loses badly at small scale. Do not promote it until a ≥1B-ish run or a proxy matmul-dominated benchmark shows the overhead is amortized. ~half-day once a large run exists.
🟡 B26. Server-side deferred tools in tinygpt serve (scaffolding shipped 2026-06-13; BFCL parity gate pending) — tinygpt serve --tool-mode {full,deferred} ships. full is default and byte-for-byte identical to today. deferred swaps in ServeToolsSpec.compactSystemPrompt() (one-line-per-tool index + get_tool_info(name) contract) and compactGrammarSpec() (verb enum extended with get_tool_info). Non-streaming /v1/chat/completions intercepts verb=get_tool_info, appends a synthetic tool result with the schema, and re-prompts (cap=3 hops). Streaming + Ollama emit the meta-tool verbatim — documented in the PRD, not a bug. PRD: docs/prds/B26-deferred-tools.md. Unit tests: DeferredToolsTests.swift. tinygpt eval-bfcl now passes --tools / --tool-mode through to its managed server, and a one-sample demo-model full-vs-deferred smoke completed 2026-06-19. Default flips on after: BFCL avg of --tool-mode deferred within ±2pp of --tool-mode full on the real specialist run, with ≤2 get_tool_info round-trips per sample.

Castform-inspired training-pipeline trio (added 2026-06-13 from docs/learn/castform-rl-finetune.md):

🟡 B28. Composite reward framework with named dimensions (scaffolding shipped 2026-06-13) — CompositeReward + RewardDimension + CompositeRewardBuilder in native-mac/Sources/TinyGPTModel/CompositeReward.swift (6 unit tests, all passing). Castform-pattern (docs/learn/castform-rl-finetune.md §1). Training-loop integrations (DPO --reward-fn, ES, GRPO 5.1) are the remaining work; viewer (C10) gains per-dim curves. PRD: docs/prds/B28-composite-reward-framework.md.
🟡 B29. Trace-to-training-data pipeline (V1 shipped 2026-06-17 — --mode sft + tool-echo drop + exact dedup + MinHash near-dedup) — tinygpt traces-to-data <atraj-dir> --task <t> --out <jsonl> consumes B22 .atraj rollouts (Sources/TinyGPT/TracesToData.swift). Smoke: evals/traces-to-data-smoke.sh + 5-trajectory fixture, asserts post-filter row counts + per-stage filter stats + --no-tool-echo-drop / --minhash-threshold 0.6 / --dry-run / --judge-model (reserved + rejected). Recipe: docs/recipes/from-traces.md. V2 follow-ups: wire tinygpt judge (E7) as a subprocess for the LLM-pivot judge step; add --mode dpo (reward-source or judge-margin); external observability ingest (Braintrust / Langfuse). PRD: docs/prds/B29-trace-to-training-data.md.
🟡 B31. Unified model gallery + project-level model pins (scaffolding shipped 2026-06-13; first specialist package registered 2026-06-19) — extends browser/src/gallery-schema.ts with a kind discriminator (browser-bin / mac-tinygpt / mac-adapter / mac-gguf / mac-safetensors-hf) so one published manifest covers browser + Mac models. New tinygpt.project.json per-project pin file (package.json-style). Swift mirrors + 11 unit tests pass in this PR (GalleryManifest.swift, ProjectManifest.swift). specialists/qwen3-4b-file-ops-distilled now provides the first TinyGPT-built package: model card, prompt, eval report, artifact lock, and MLX validation helper for the fused file-ops distilled 4B. tinygpt pull + tinygpt validate CLI extensions + browser UI filter remain. PRD: docs/prds/B31-gallery-and-project-pins.md. The trace-loop dividend: project pins flip the Castform asymmetry — pinning + serving locally means the project owner naturally accumulates .atraj traces (B22) that B29 turns into training data. The substrate-refinement cycle closes here.
✅ B30. Prompt reasoning-depth classifier (shipped + verified 2026-06-17) — tinygpt reasoning-classify --train|--score|--filter labels prompts as {single-hop, multi-hop, comparison, other}. Bag-of-trigram softmax-4 (the FineWeb-Edu shape extended to multiclass), TGFR on-disk format. Files: Sources/TinyGPT/ReasoningClassify.swift, subcommand wired in TinyGPT.swift. Smoke: evals/reasoning-classifier-smoke.sh + evals/reasoning-classifier-fixtures/{train,heldout}.jsonl. Smoke result: macro-F1 1.000 on the 32-row held-out (well above PRD’s 0.5 bar); score + filter modes verified. Recipe: docs/recipes/balanced-training-mix.md. BagOfNgramClassifier shared utility deferred — V1 duplicates the tokenize/hash/ngram block from QualityClassifier. PRD: docs/prds/B30-prompt-reasoning-classifier.md.

Market-landscape positioning (added 2026-06-13 — see docs/sessions/2026-06-13-market-landscape-mac-first.md):

The competitive scan found the whole field monetizes the cost a Mac-first tool zeroes out (cloud GPU rent / trace ingestion) and is consolidating into infra + frontier-lab acquirers. Three whitespaces: Mac-first training as a product (B6 + B31), eval+interp+local fused (already shipped — the moat), and academic agent benchmarks as a local CI gate (B32). These two items reframe shipped infra as product surfaces.

🟡 B32. tinygpt eval as a CI / pre-commit gate (shipped 2026-06-13; K-pass stats + budget metadata added 2026-06-18; live multi-suite GPU run pending a self-hosted runner) — tinygpt eval-gate runs declared suites vs a baseline and exits non-zero on regression. Pure gate logic in TinyGPTModel/EvalGate.swift (direction heuristic, pp thresholds, per-suite override, missing-baseline handling, K-pass mean + stdev/stderr/95% CI + optional protocol budget) with unit tests; CLI orchestration in Sources/TinyGPT/EvalGate.swift (--candidate no-GPU path, --update-baseline, --passes, --budget, gate-result.json). Spec lives in eval-gate.json or the tinygpt.project.json eval block (B31 schema add). GitHub Action .github/actions/tinygpt-eval-gate/, recipe docs/recipes/eval-gate.md, smoke evals/eval-gate-smoke.sh (asserts exit 0 match, exit 1 regression, repeated-run stats, and budget metadata). Flips to ✅ once a real specialist’s suites run end-to-end through the gate on a self-hosted Mac runner. PRD: docs/prds/B32-eval-ci-gate.md.
⬜ B35. Local-agent vertical PoC — code reviewer on a Mac — future-looking kill-or-validate experiment surfaced 2026-06-17 by the Vercel Eve launch. Eve validates the agent-platform thesis but is cloud-bound; tinygpt already owns ~70% of a local-agent stack (B22 + B26 + B28 + B29 + B30 + B32 + QLoRA + serve) and can target the wedge Eve doesn’t address: zero-cloud, specialist-distilled agents that run entirely on the user’s Mac. PRD: docs/prds/B35-local-agent-vertical-poc.md. Kill criterion: 4-week timebox, ≥5pp lift over zero-shot open baseline on the chosen code-review eval, else publish the negative result and keep tinygpt narrow as a model factory.
🟡 B33. tinygpt quickstart — data → trained specialist in one command — CLI wizard: inspect data → auto-pick base from gallery → infer recipe → train → eval vs base → drop into chat. The CLI sibling of B6’s GUI Factory tab; closes the gap between “MLX-LM can technically do this” and “a non-ML-engineer actually does it.” PRD: docs/prds/B33-laptop-finetune-onboarding.md. Status: decision core (RecipeResolver, pure + unit-tested) + CLI (Quickstart.swift + dispatch) + --dry-run plan & project.json emission + evals/quickstart-smoke.sh + docs/quickstart.md shipped. Live train→sample path wired (orchestrates sft/sample); user runs it on a Mac. Follow-ups: from-scratch raw-text path, auto-pull of bare gallery ids, quantitative eval-vs-base delta.

External-leaderboard arc (added 2026-06-05 — first public competitive submission target):

⬜ B27. Mac SLM agentic leaderboard v0 (scaffolding shipped 2026-06-13) — one publication-shape artifact at docs/research/mac_slm_leaderboard_v0.md cross-cutting BFCL + τ-bench + Pace unhappy-paths + decode tok/s + peak RSS. scripts/eval_slm_full.sh <model-id> <tag> runs all four suites against one LM Studio model; scripts/build_slm_leaderboard.py --manifest … rebuilds the table. Composite = accuracy × speed × cost, citing score_formula.py (DRY — no re-derivation in the doc). Status flips to ✅ when ≥2 models land on the board so the table actually compares something. First-model target: gemma-3-12b-it (Run 5 in docs/research/mac_decode_baseline_m5pro.md).
⬜ B25. ScaleDown Challenge specialist — extractive context compression — train a task-specific SLM that takes (query, long_context) and returns the subset of sentences relevant to the query. Token-level relevance classifier head on the residual stream → sentence-level aggregation → threshold-keep. Training data: MS-MARCO + Natural Questions + similar (query, doc, answer) triplets with teacher-labeled per-sentence relevance scores; teacher can be a local Qwen/SmolLM. Eval via ScaleBench (their open-source harness, downstream F1/EM after compression). Submit to the ScaleDown Challenge leaderboard. ~3-5 days end-to-end: dataset pulls (~1 hr, reuse tinygpt download-dataset), teacher-labeling pipeline (~half-day), classification-head module in Sources/TinyGPTModel/ (~half-day), new tinygpt compress subcommand with token-level BCE loss (~1 day), ScaleBench integration + submission (~half-day). Pairs naturally with A1 (different domain, same A-track shape) and gives TinyGPT a public proof-point — “competitive task SLM trained from scratch on a Mac” — with an external scoreboard. See §4.3.

Tier C — POLISH (mostly shipped this session)

✅ C1. CLI cosmetic fixes — 27 subcommands now exit(0) on --help; bench-train --help shows correct name. Shipped 2026-06-02 in 49dead5.
✅ C2. Roll up pre-switch CLI shims into main switch — 17 shims absorbed; TinyGPT.swift -170 LoC. Shipped in 49dead5.
✅ C3. DoRA on-disk adapter format (SHIPPED 2026-06-09) — TGLA v2 (magic TGLA, version 2) adds optional per-entry [out] magnitude vector after loraB; v1 readers ignore it; v2 readers autodetect. See Sources/TinyGPTModel/LoraIO.swift header. Memory: [[project_dora_fix_shipped_2026_06_09]].
✅ C4. Tool-call extractor: BPE tokenizer support (SHIPPED 2026-06-17) — tinygpt train-extractor --tokenizer <hf-dir> routes encode() through HFTokenizer instead of UTF-8 bytes; the tokenizer path is persisted in the checkpoint header (tokenizerSource). ToolRouterLoader.load surfaces it on cfg.tokenizerSource; Agent.swift autoloads the same tokenizer when the router declares one; AgentLoop.RouterHook gains an optional tokenizer field and predictWithRouter uses it when set. Byte-level remains the default + the fallback on tokenizer-load failure. Train-time warning fires when --tokenizer is set but --vocab-size is left at 256, and again if any encoded token ID lands outside [0, vocab-size).
⬜ C5. Decode jitter under thermal load — ~1 day (needs sustained workload measurement)
✅ C6. ChatML template inline-system split — splitChatmlSystem helper + 6 unit tests. Shipped in 49dead5.
✅ C7. Save+reload XCTest for LoRA adapters — roundtrip + arch-mismatch coverage. Shipped in 49dead5.
✅ C8. Install-path discipline — ~/.cache/tinygpt/ for adapters + corpus discovery; off /tmp. Shipped in 49dead5.
✅ C9. Determinism harness (SHIPPED 2026-06-17) — --seed N now seeds both MLXRandom AND BatchRng (Splitmix64-backed host generator) in Sources/TinyGPTModel/BatchRng.swift. All 7 corpus-sampler Int.random(in:) call sites swapped to BatchRng.randomInt(in:) across Trainer.swift, SFTCorpus.swift, PreferenceCorpus.swift. Two runs with the same seed now produce identical batch sequence (modulo prefetcher scheduling caveat — see docs/determinism.md). 5 unit tests pin the contract: same-seed determinism, different-seed divergence, reset-then-reseed reproducibility, range bounds, Splitmix64 bit-pattern.
✅ C10. Training-run dashboard (SHIPPED) — --log-jsonl <path> in tinygpt train emits append-only JSONL via Sources/TinyGPT/TrainLog.swift; consumed by browser/src/pages/training-dashboard.astro for live charts.

Tier 5 — RESEARCH FRONTIER (2026 stretch goals)

Pauses the “training at 2024 fundamentals” cadence; deliverable is a paper-shaped artifact + reproducible code + a scaling-curve point, NOT a polished UX feature.

⬜ 5.1 Reasoning training on a 22M model — 5-7 days; expected outcome is the negative result (CoT below emergence). Publishable.
⬜ 5.2 Test-time compute scaling — 3-5 days; quality-vs-FLOPs plot at 22M-scale matching Snell et al. methodology. Most cleanly publishable.
⬜ 5.3 Vision-language toy — ~2 weeks; ViT + projector + LLaVA-style. Smallest from-scratch VL model on consumer hardware.
⬜ 5.4 Diffusion LM micro-implementation — 1-2 weeks; new paradigm via masked denoising loss.
⬜ 5.5 Real sparse MoE kernels — 2-3 weeks; custom Metal kernel + measure FLOP reduction.
⬜ 5.6 TTS toy (text-to-speech via audio-token GPT) — ~2-4 weeks; integrate EnCodec, train an autoregressive decoder over discrete audio tokens (VALL-E / MusicGen shape). The transformer side already exists in TinyGPT; the new pieces are codec integration, text→audio conditioning, vocoder decode, and an audio data pipeline. Scoping note (2026-06-03): comes AFTER the Wave 3 specialist track (A1-B8) AND after 5.3 vision-language toy — both higher-priority research arcs ahead of it.
⬜ 5.7 Specialized explainer-video model — ~3-6 weeks for a Lamina-like toy: document/prompt → script → storyboard DSL → deterministic whiteboard/diagram render. This is NOT a Sora/Runway competitor; the first useful version is a specialized visual-planning model plus renderer. Scoping note (2026-06-04): comes after A1-B8 and after 5.3 VL, because it needs both specialist training discipline and the text↔visual bridge.

5.6 TTS toy — detailed scoping

What carries over from current TinyGPT:

Piece	Reuse
Transformer decoder, KV cache, sampling, MTP heads (for K-codebook prediction)	direct
Training loop (`tinygpt train`) + PEFT bundle for downstream fine-tunes	direct
`CrossAttention.swift` (currently used for YOCO)	adapt to text-encoder K/V source for conditioning

New code surface (~2 weeks of focused engineering + 3-7 days training):

Piece	Effort
EnCodec encode/decode integration (Swift port of the HF EnCodec weights)	~3-5 days
Text → conditioning surface (text encoder + cross-attention into decoder, OR text-as-prefix-tokens)	~2-3 days
Audio data pipeline (LJSpeech / LibriTTS pre-tokenization to codec ids)	~2-3 days
Eval (WER via Whisper transcription, MOS estimator)	~2 days
First training run on LJSpeech single-speaker → intelligible speech	2-4 days wall-clock

Realistic outcome at this scale: smallest-published audio-token GPT (MusicGen-small) is ~300M; from-scratch on LJSpeech you get recognizable but not natural-sounding speech. The publishable artifact is the same shape as 5.3 — “smallest from-scratch ___ on consumer hardware.”

Why ordered after specialist + VL:

Specialist track validates the north-star thesis (Wave 3 work the project is actually about). Until at least one specialist beats a baseline, modality experiments are noise on top of unproven foundations.
5.3 vision-language toy is ahead because (a) it’s the older Tier-5 item and (b) it stress-tests the same “external pretrained encoder + cross-attention into our decoder” pattern that TTS would reuse. Shipping VL first means TTS inherits a validated pattern instead of a speculative one.

5.7 Specialized explainer-video model — Lamina-like track

Reference product: Lamina Labs’ Simi positions itself as an AI explainer studio: prompt or document in, whiteboard-style educational video out for students, course creators, customer training, and teams. The public lesson is not “train a giant cinematic video model”; it is “make a narrow video system that explains accurately, quickly, and consistently.”

The TinyGPT version should start as a structured explainer compiler:

source document / prompt
  -> lesson script
  -> storyboard scenes
  -> visual DSL (objects, labels, arrows, equations, timeline)
  -> deterministic renderer (SVG/canvas/Remotion/Manim-style)
  -> captions + voiceover + MP4

What we would need:

Piece	Build	Why
Scene/storyboard schema	JSON DSL for concepts, equations, diagrams, timings, camera/stroke actions	Gives the model a constrained target instead of free-form pixels
Renderer	Start with SVG/canvas frames; later Remotion/Manim export	Deterministic, debuggable, cheap to render
Visual-planner specialist	SFT/LoRA model: prompt/doc → storyboard DSL	This is the first “specialized video model” worth training
Asset/diagram library	Shapes, arrows, axes, code blocks, graph layouts, simple physics/math primitives	Explainers need reusable semantic primitives more than photorealism
Data pipeline	Pair open lessons/transcripts/docs with generated or human-edited storyboards	The scarce asset is supervised storyboard data
Eval set	Held-out concepts with rubric: factual correctness, visual grounding, pacing, label consistency, equation validity	Prevents “pretty but wrong” videos
Editing loop	User can regenerate one scene, lock script, lock diagrams, export MP4	Real workflows need partial repair, not one-shot magic

Model ladder:

No learned video model: use a strong text model or cloud model to produce the DSL; render deterministically. This validates product and schema fast.
Tiny visual-planner specialist: fine-tune tinygpt/HF-loaded base on prompt/doc → storyboard DSL. This is the first trainable model.
Visual critic/evaluator: model scores whether scene frames match the script and flags bad labels, missing objects, impossible diagrams.
Optional diffusion/image/video model: only for decorative assets or scene backgrounds after the deterministic explainer path works.

Good first eval tasks:

Eval	Metric
Concept-to-storyboard	JSON validity + human/LLM rubric on lesson coverage
Equation/diagram correctness	Symbol/label exactness, graph/axis consistency
Script-to-scene grounding	Every narrated claim maps to an on-screen object/action
Pacing	Scene duration fits narration without overcrowding
Editability	Regenerate one scene without changing locked scenes

Why this is plausible for TinyGPT:

The project already has specialist SFT/LoRA, structured output, constrained generation, eval harnesses, and renderer-friendly web/native surfaces.
A storyboard DSL is text. TinyGPT can train on that before any pixel generation exists.
Deterministic rendering avoids the hardest part of video generation: long-horizon visual consistency.

Why it stays behind the current specialist track:

It is a new modality product, not a training-foundation prerequisite.
Data is the bottleneck. We need hundreds to thousands of good storyboard pairs before model training is meaningful.
The first marketable version is mostly pipeline + UX, not raw model research. Build it only after the first text/tool specialist proves the project can beat a baseline.

Unshipped techniques — after applying the value-add filter (and re-auditing)

Most items in the original roadmap-categories list either ship today (third-audit corrections below) or were dropped under the user’s “don’t list a technique unless it adds genuinely new capability” filter. What’s left:

Genuinely new value-adds, not yet built

After this session’s batch closes, the non-training surface IS exhausted — every capability item under your value-add filter has shipped. Only niche residue remains:

⬜ Sample packing (cross-source) — niche, doesn’t change capability at our scale
⬜ Vocab trimming — niche, only matters for embedded-deployment

After these: training-dependent (specialist Wave 3, Mini-Llama+ANE, Tier 5 modality arcs) or upstream-blocked (sparse MoE hard routing on scatter_add, real QLoRA on quantized-gradient flow).

Shipped this session (third → fourth audit pass corrections):

✅ Linear probes (tinygpt linear-probe)
✅ Deduplication (tinygpt dedupe, line + doc modes)
✅ ROME (tinygpt rome, identity-Hessian first cut)
✅ MEMIT (tinygpt memit, single-layer least-squares, exact per-fact residual at scale=1)
✅ Multi-layer MEMIT (--layers SPEC, residual partitioned across N layers; 8-14% per-layer rel vs 41-72% single-layer)
✅ MEMIT --layer-weighting key-norm (data-driven proxy for Meng 2023’s causal-trace influence)
✅ GGUF reader (GGUFReader.swift + tinygpt gguf-inspect — F32/F16/Q4_0/Q8_0/Q4_K/Q5_K/Q6_K/Q8_K)
✅ GGUF model loader validator (tinygpt gguf-load — metadata parse, tensor-name mapping, shape validation against TinyGPT-HF op tree)
✅ Best-of-N + Snell-style scaling curve (tinygpt bon --scan)
✅ bon --verifier corpus-ppl (corpus-anchored PPL as scoring signal — distinct from self-likelihood)
✅ Sparse autoencoders (tinygpt sae — Bricken et al. 2023; encoder + decoder + L1, .sae sidecar)
✅ SAE feature explorer (tinygpt sae-explore — load .sae, scan corpus, surface top-K activating windows per feature)
✅ Activation patching CLI (tinygpt patch — Mac CLI for zero + donor-swap; reuses shipped forwardWithPatch)
✅ Causal trace CLI (tinygpt causal-trace — Meng et al. 2022 per-layer fact localization)
✅ MinHash near-duplicate dedup (tinygpt dedupe --near-dup — catches paraphrased boilerplate that exact-SHA misses)
✅ GGUF tokenizer + config extractor (tinygpt gguf-extract — writes tokenizer.json + config.json + manifest, the missing piece between gguf-load and runnable model)
✅ to-coreml conversion bridge (tinygpt to-coreml — generates a tailored Python conversion script for the user’s coremltools install; now end-to-end runnable via safetensors hop)
✅ Safetensors writer (TinyGPTModel/SafetensorsWriter.swift — HF-compatible binary format; shared foundation)
✅ tinygpt to-safetensors — converts .tinygpt → model.safetensors with HF Llama tensor names (or --keep-names for native). Verified 196 tensors / 38.4 MB / valid HF format on the shakespeare gallery model.
✅ tinygpt export-mlx — packages .tinygpt students, .lora / .tgla adapters, and existing HF dirs as MLX-friendly safetensors directories with config/tokenizer sidecars plus mlx_load.py. This is the interop path for users who want to take TinyGPT fine-tune/distill artifacts into Python MLX or MLX-Swift.
✅ gguf-extract materializes weights to safetensors — output directory is now a complete HuggingFace model bundle loadable via transformers.AutoModelForCausalLM.from_pretrained(). Verified on a 21-tensor llama-shape GGUF: tokenizer.json + tokenizer_config.json + config.json + model.safetensors all populated.
✅ to-coreml safetensors bridge — Python script no longer stubbed; loads weights via safetensors.torch.load_file() with full HF Llama → TinyGPT name-map. py_compile clean.

Stale ⬜ markers caught + corrected this session — now ✅:

Item	Where it ships
Embedding RMSNorm	`--embedding-rmsnorm` flag, `RMSNorm` module on token-embed
DeepNorm	`--deep-norm` flag, `cfg.useDeepNorm`/`deepNormAlpha`/`deepNormBeta`
Layer-wise LR decay	`cfg.lrLayerDecay`
Cosine warmup	`--lr-schedule cosine --warmup 500` (the curated default)
BPE-dropout	`BPEDropout.swift`
Real CI	`.github/workflows/ci.yml` + `deploy.yml`
Persistent tokenized cache	`TokenCache.swift` wired into Train+Eval+Distill+Finetune
Linear probes	`tinygpt linear-probe` (this session, `6dbe15c`)
YOCO cross-layer KV	`--yoco` flag, `CrossAttention.swift`, `docs/yoco_results.md`
GPTQ safetensors reader	`GPTQReader.swift` (72 tensors quantised in 31s)

Dropped under value-add filter (duplicate / inferior / niche):

Dropped	Why
ReLoRA	GaLore already gives “full fine-tune at LoRA memory cost”
Prefix tuning / soft prompts	LoRA covers the practical case
IPO	DPO with high β covers tiny-pair regularization
Token elimination	StreamingLLM + KIVI cover positional + per-entry-bits axes
Tree decoding	Speculative decode (vanilla + Medusa + EAGLE-2) covers the niche
Curriculum learning	Modest gains, scale-dependent; needs a difficulty metric we don’t have
Self-instruct / Evol-instruct	Magpie subsumes (uses model’s own distribution, no seed needed)
Hard example mining / Importance sampling	Marginal at our scale
Data quality filtering	PPL-filtering needs a ref model; basic dedup covers most of the value
BigBird / Longformer sparse attention	Only matters past ctx=8192 (we don’t train at that length)
Linear attention (Performer / Linformer / Reformer)	Quality usually worse than flash attention
Hybrid attention/SSM (Jamba, Samba)	Different family; side-project
Pre-norm vs post-norm toggle	Config knob, not a feature
Tiktoken adoption	swift-transformers handles BPE-family tokenizers already
Subword regularization	Marginal vs BPE-dropout
Train own BPE on corpus	Modest gain (~5% PPL); blocked on Rust-FFI for speed
TinyGPT-as-library API	User explicitly deferred until specialists beat a baseline

Queued findings — ANE routing + Mac-vs-browser sampling

Triggered by the question “how do we get to 170× instead of 17×?” The 17.2× number is at Huge training — small bandwidth-bound model where kernel-launch overhead dominates. Several legitimate paths to a much larger ratio; each is queued with its honest cost.

1. Browser sampling tok/s harness — CHEAP, ~30 min

Closes a real missing measurement. We have Mac sampling tok/s (293-696 by model size) but no analogous browser-side number. The playground worker generates via GpuModel.generate already; we just don’t time it.

What: in browser/src/worker.ts, log per-token wall-clock in the generate loop, post a sampling_perf message, display tok/s next to the playground output.

Expected ratio: Mac-vs-browser sampling probably 30-80× at Huge based on shape priors (Mac is much less kernel-launch-overhead sensitive during decode than during training). That alone changes the headline from “17× training” to “30-80× sampling, 17× training.”

Why queued: tiny work, just hasn’t been done. No blockers.

2. ANE-routed inference via Mini-Llama TinyGPT — MEDIUM, 1-2 weeks

Apple Neural Engine routes only when the graph hits its preferred shapes. The published numbers (ANEMLL, perf-quest memory) are 2-3× sampling over the same model on Apple GPU when ANE engages cleanly, not 100×+ end-to-end. The big win is the combined ratio: bigger ANE-friendly model × ANE-routing × already-unfit-for-browser size.

Why TinyGPT doesn’t route today

ANE prefers head_dim ∈ {64, 128}, tensor dims multiples of 64, fp16, RoPE-style attention, bias-free linears, RMSNorm. Our Huge default is the opposite of all of these:

Dimension	TinyGPT Huge	Llama 3.1 8B	ANE impact
`head_dim`	32	128	falls off ANE matrix engine
`d_model`	256	4,096	tiny matmuls under-utilize ANE tiles
`vocab`	256 (byte)	128,256 (BPE)	LM-head matmul too small to matter
Norm	LayerNorm	RMSNorm	RMSNorm has better ANE op coverage
Positional	learned absolute	RoPE	ANE’s fused-attention paths assume RoPE
MLP activation	GELU	SwiGLU	SwiGLU is the ANE-tuned default
Linear bias	yes	no	bias-free fuses cleaner into matmul-add

What to build

A new ModelConfig preset — mini-llama — using only existing config flags (every one of the above is already a knob):

ModelConfig(
    vocabSize: 32768,        // small BPE, multiple of 64
    contextLength: 2048,
    nLayers: 24,
    nHeads: 16,              // head_dim = 128
    nKvHeads: 4,             // GQA
    dModel: 2048,
    dMlp: 8192,
    useRoPE: true,
    useRMSNorm: true,
    useSwiGLU: true,
    tieEmbeddings: false,
)
// ~600M params; scale down to (1280, 16) for ~200M first cut

Plus tinygpt to-coreml exporter (~1-2 days): maps our transformer ops to CoreML’s op set, produces a .mlpackage that Instruments can profile to see whether ANE actually engages.

Realistic speedup expectations

Path	Realistic tok/s
Current Huge on Mac GPU	293-696
Mini-Llama (~600M) on Mac GPU	~150-400
Mini-Llama on Mac ANE if it routes	~400-1200 (~2-3× over its own GPU)
Mini-Llama in browser	~5-20 (probably can’t load; 600M near browser ceiling)
Mac-ANE vs browser ratio	30-200× depending on routing cleanliness

Probability analysis

Test 1 (ANEMLL works on Llama 3.1 on your machine) → confirms the environment but NOT that our model routes. Independent reasons it could still fail:

ANEMLL on Llama works?
├─ No  → done, environment broken
└─ Yes → environment confirmed
         └─ Build tinygpt to-coreml exporter
            └─ Convert + profile Mini-Llama
               ├─ All ops on ANE     → 🎉 ~30-50% chance, you win
               ├─ Partial split      → 🟡 ~40% chance, measure if net speedup
               └─ Nothing on ANE     → 😐 ~10-20% chance, GPU is the ceiling

Cost-benefit (honest)

Item	Cost	Outcome regardless of ANE result
Train Mini-Llama (200-600M)	3-7 days mostly-background	Real Llama-architecture gallery model. Useful independently.
`tinygpt to-coreml` exporter	1-2 days focused	Reusable for any future model. Useful independently.
Profile + iterate	1-3 days unpredictable	Empirical learning either way.

Total: 1-2 weeks calendar; dominated by training wall-clock.

Why queued

Requires lifting the current “no training” goal constraint
The trained Mini-Llama IS a valuable artifact independent of ANE, so the conditional EV is positive — but only if you’re willing to train.
Doesn’t deliver 10× on Mac-alone (realistic 2-3× ANE-over-GPU); delivers 30-200× only via the Mac-ANE-vs-browser combined ratio.
The cheaper browser-sampling-benchmark (item 1 above) is a prerequisite to even know the current sampling ratio — should do that first.

Apple’s actual ANE landscape (for posterity)

CoreML (public) — convert to .mlpackage, Apple’s runtime decides per-op CPU/GPU/ANE dispatch. Heuristics are opaque. No way to force ANE.
ANEMLL (community, github.com/Anemll/Anemll) — uses private CoreML internals to coerce more ops to ANE. Works on macOS Sequoia. Historically breaks on every macOS update. Hand-tuned for Llama-family.
“Stateful Models API” (rumored late 2026) — would make ANE routing first-class. Not shipped.

There is no Apple-sanctioned “private beta” for ANE inference; that phrasing was loose. The real options are the three above.

4. Research absorbed — paper × verdict

External-paper catalogue (was docs/roadmap/recent_research.md, now archived at docs/archive/recent_research.md). Each row: technique → one-line source → verdict pointing at where it lives in this codebase, or why it doesn’t.

4.1 Implemented (techniques we ship)

Alignment / preference

Technique	Source	Where it lives
DPO	Rafailov et al., NeurIPS 2023	`tinygpt dpo`
KTO	Ethayarajh et al., 2024	`tinygpt dpo --variant kto`
ORPO	Hong et al., 2024	`tinygpt dpo --variant orpo`
SimPO	Meng et al., 2024	`tinygpt dpo --variant simpo`
NEFTune	Jain et al., NeurIPS 2023	`--neftune`

PEFT

All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, surfaced via tinygpt sft.

Technique	Source	Where it lives
DoRA	Liu et al., 2024	default in `sft`
GaLore	Zhao et al., 2024	`Optimizers.swift`
LoftQ	Li et al., ICLR 2024	`PeftVariants.swift`
VeRA	Kopiczko et al., ICLR 2024	`PeftVariants.swift`
PISSA	Meng et al., 2024	`PeftVariants.swift`
LoRA+	Hayou et al., ICML 2024	`PeftVariants.swift`
rsLoRA	Kalajdzievski, 2023	`PeftVariants.swift`

Quantization

Technique	Source	Where it lives
GPTQ	Frantar et al., ICLR 2023	`tinygpt gptq` + `GPTQReader.swift`
AWQ	Lin et al., MLSys 2024	AWQ safetensors reader
HQQ	Badri & Shaji, 2024	`tinygpt hqq`
KIVI	Liu et al., 2024	KV cache quantization path

Inference / efficiency

Technique	Source	Where it lives
Speculative decoding	Leviathan et al., ICML 2023	`tinygpt train-heads --type medusa\|eagle` + decode loop
Medusa	Cai et al., 2024	same path, head type
EAGLE-2	Li et al., 2024	same path, head type
StreamingLLM	Xiao et al., ICLR 2024	attention-sink path

Architecture variants

Technique	Source	Where it lives
MTP	Gloeckle et al., ICML 2024	`Train.swift`, `docs/mtp.md`
Differential Transformer	Microsoft 2024	`DifferentialAttention.swift`, `--diff-attn`
Mixture of Depths	Raposo et al., 2024	soft sigmoid gate (hard top-K upstream-blocked)
LASER	Sharma et al., ICLR 2024	`tinygpt laser`

Optimizers

Technique	Source	Where it lives
Sophia	Liu et al., 2023	`Optimizers.swift`
Lion	Chen et al., NeurIPS 2023	`Optimizers.swift`
Muon	Jordan, 2024	`Optimizers.swift`
GaLore	(see PEFT)	`Optimizers.swift`

Distillation

Technique	Source	Where it lives
Soft-targets distillation	Hinton et al., 2015	`tinygpt distill`

Synthetic data

Technique	Source	Where it lives
Magpie	Xu et al., ICLR 2025	`tinygpt magpie`
TinyStories	Eldan & Li, 2023	dataset source

Test-time compute

Technique	Source	Where it lives
Best-of-N	Snell et al., 2024	`tinygpt bon --scan`

Evolution Strategies

Technique	Source	Where it lives
ES at scale	Qiu et al., Sept 2025	`tinygpt es`, `docs/evolution_strategies.md`

4.2 Cannot — blocked, parked, or skipped

🚧 Blocked by hardware

Technique	Source	Why parked
BitNet b1.58	Ma et al., 2024	Ternary from-scratch needs 100B+ tokens to validate; not differentiating at <1B params on our hardware. Park; revisit if a clear gallery-model use case appears.
FP4 training (NVFP4 / Quartet)	Wang Jan 2025 · Quartet II Jan 2026	Apple M-series has no native FP4 ops
FP8 training	—	Needs H100 / Blackwell

🚧 Blocked upstream

Technique	Source	Why parked
Hard sparse MoE routing	DeepSeek-V3 family	MLX-Swift no `scatter_add`; soft (dense) routing ships
Real QLoRA	Dettmers et al., 2023	MLX-Swift quantized arrays don’t autograd through; manual fake-quant shipped (pedagogical, no memory win)

❌ Skipped — different family / not worth the seat

Technique	Source	Why skipped
Mamba / Mamba-2	Gu & Dao, 2023/2024	Linear-time SSM, different family; better as side-project

❌ Dropped — value-add filter (subsumed by what ships)

Technique	Source	Subsumed by
IPO	Azar et al., 2023	DPO with high β regularizes equivalently
CPO	Xu et al., 2024	DPO + BC term marginal over SimPO at our scale
Self-Instruct	Wang et al., 2023	Magpie (model’s own distribution; no seed needed)
Evol-Instruct	Xu et al., 2024 (WizardLM)	Magpie subsumes
MiniPLM	Gu et al., NeurIPS 2024	Distill-for-pretraining — needs a teacher-student pair we don’t have
Distillation with Training Wheels	Feb 2025	`cloud-escalate` already provides the analogous “student asks teacher” deployment shape
DEITA	Liu et al., 2024	Instruction-data quality framework — only matters once SFT corpus > 1M samples

4.3 Planned — queued for a future training run

Item	Source	Where in §3
GRPO / DAPO (RLVR pipeline)	DeepSeek-R1, Jan 2025 · DAPO, March 2025	Tier 5 §5.1 — Reasoning training on a 22M model. GRPO = mental model; DAPO = implementation.
Reasoning-trace distillation	DeepSeek-R1-Distill series, OpenThoughts	Tier 5 §5.1 — SFT-on-traces is the first half of §5.1 before RLVR
Snell test-time-compute scaling experiment	Snell et al., 2024	Tier 5 §5.2 — `bon` shipped; the scaling-curve experiment at 22M matches Snell methodology
Vision-language toy	LLaVA family	Tier 5 §5.3
Diffusion LM micro	(multiple)	Tier 5 §5.4
Real sparse MoE kernels	DeepSeek-V3 style	Tier 5 §5.5 (also upstream-blocked on `scatter_add`)
TTS toy	VALL-E / MusicGen family	Tier 5 §5.6

Small additions, no current owner — append when a slot opens:

Item	Source	Effort
LISA optimizer	Pan et al., 2024	~1 day; layerwise importance sampling, drop-in alongside Sophia/Muon
MiniLLM KL variants	Gu et al., ICLR 2024	~1-2 days; reverse-KL / skew-KL switches on top of existing `tinygpt distill`
Distilling Step-by-Step	Hsieh et al., ACL 2023	~1-2 days; rationale-distillation recipe on top of `tinygpt distill`
DoReMi data-mixture optimization	Xie et al., NeurIPS 2023	Park until ≥3 distinct domains are mixed at non-trivial scale
Quality classifier (FineWeb-Edu-style)	Penedo et al., 2024 — FineWeb / FineWeb-Edu	§3 B10 — ~2 days; tiny fastText scorer + top-X% filter
WSD schedule (warmup-stable-decay)	MiniCPM, Hu et al., 2024 · SmolLM blog	§3 B11 — ~half-day; decay phase doubles as annealing
Interp-on-checkpoints methodology	Pythia, Biderman et al., 2023 · OLMo, Groeneveld et al., 2024	§3 B13 — 1-2 days infra + ongoing analysis; replay SAE / MEMIT across the checkpoint timeline
Speculative decoding	Leviathan et al., ICML 2023 · Chen et al., 2023	§3 B14 — 2-3 days; Mini-Llama draft for Mega; numerics gate required
Layer-wise LR decay (SFT)	ULMFiT, Howard & Ruder, 2018	§3 B15 — ~half-day flag add on existing optimizer
M5 GPU Neural Accelerator prefill benchmark	Apple ML Research, 2026	§3 B16 — ~half-day; verify the claimed 3.5× M5-vs-M4 prefill speedup is materializing on our path
SAE Lens interop / Neuronpedia format export	decoderesearch/SAELens	§3 B17 — ~2 days for format-export option; compare-and-decide before building
nanochat-style `--depth` single-knob HP derivation	karpathy/nanochat	§3 B18 — ~1 day; one knob auto-derives width / heads / LR / batch / steps; UX win
Group-SAE (layer-group SAE training)	Wang et al., 2024	§3 B19 — 2-3 days; trains SAEs once per layer-group instead of per-layer; cuts SAE training cost
Learnable cross-stream attention (modded-nanogpt speedrun trick)	KellerJordan/modded-nanogpt	§3 B20 — read-and-evaluate; speedrun-specific, not yet a paper
ScaleDown extractive context compression SLM	ScaleDown blog · Challenge leaderboard · scaledown.ai	§3 B25 — 3-5 days; token-level relevance head + sentence aggregation; submit to public leaderboard as a “specialist trained on a Mac” proof-point
Micro-AutoMixer for specialist data mixes	Poolside Laguna deep dive · RegMix/DoReMi-style mixture search	§3 B21 — small proxy-run version of Poolside’s automixing; optimize specialist ratios before full training
Token-preserving agent trajectory recorder	Poolside Laguna deep dive	§3 B22 — preserve token IDs through rollout → training so agent traces cannot drift through retokenization
Agent eval protocol hardening	Poolside Laguna deep dive	§3 B23 — repeated pass@1, fixed step/resource/sampling budgets, and explicit infra-patch notes
Muon large-scale re-benchmark	Poolside Laguna deep dive · Jordan, 2024	§3 B24 — only revisit if large/proxy matmul-dominated runs amortize Newton-Schulz overhead

4.4 Reference reads (no verdict — context only)

For mental-model framing, not techniques to implement:

State of GPT (Karpathy, 2023) — pretrain → SFT → RM → PPO; we skip RM/PPO for DPO
Tulu 3 (Lambert et al., 2024) — open RLVR recipe; informs §5.1
SmolLM blog (HF, 2024) — 135M/360M/1.7B small-model recipe
HuggingFace Alignment Handbook (repo) — reference SFT/DPO recipes at 7B
Survey of LLMs (Zhao et al., arXiv 2303.18223) — broad survey, continuously updated
On-Policy Distillation Survey (April 2026) — confirms distillation dominates for shipping small models

2026 small-model peers (for positioning, not adoption): SmolLM3-3B · Qwen3.5-0.8B · Phi-4-mini-instruct · Gemma-3n-E2B-IT · Gemma-4-12B Unified (encoder-free multimodal, 256K ctx, MLX variants exist). Implication: the niche is “browser-trainable + every byte of training code is here,” not “perf-competitive with Phi-4.”

Direct from-scratch peers (full pipeline, not just pretrain):

karpathy/nanochat — tokenizer → pretrain → SFT → RL → CLI/web chat in one repo. $48/2h on 8×H100. Apple Silicon mode exists via runs/runcpu.sh (degraded scale). No interpretability story. Single --depth knob auto-derives all HPs. Closest head-on competitor; differentiation = Mac-first + interp lab.
KellerJordan/modded-nanogpt — speedrun fork; April 2026 record 1.35 min to GPT-2 quality on 8×H100. Playbook: Muon (we have) · FA3 · FP8 head (HW-blocked) · learnable cross-stream attention · MTP (queued).
Poolside Laguna XS.2 / M.1 deep dive — agentic coding models with open XS.2 weights, strong SWE/Terminal benchmark protocol, quality+diversity data curation, synthetic data throughout pretraining, automixed data ratios, Muon at scale, and async agent RL. Steal the workflow discipline, not the scale: data-mix proxy sweeps, token-preserved agent traces, repeated eval protocol, and Muon only after large-scale re-benchmark.

Tools worth knowing:

Unsloth — Triton-kernel fine-tune framework; not Mac/MLX but study for technique transfer. Feb 2026: 12× faster MoE training + embedding model support + ultra-long-context RL.
Axolotl — config-driven multi-GPU production fine-tuner; multimodal support landed 2026
LLaMA-Factory — web-UI fine-tuner (LlamaBoard); zero-config entry point
TorchTune — Meta’s PyTorch-native fine-tuner; ~20-24% speedup via PyTorch 2.5 compile
Argilla Distilabel — Python pipeline for synthetic SFT/DPO (wraps Magpie/DEITA)

Apple Silicon ecosystem (direct peers on our platform):

mlx-lm — Apple’s official MLX inference + LoRA / DoRA / QLoRA / full fine-tune + OpenAI-compatible server. Direct overlap with our SFT/DPO LoRA path; differentiation = pretrain + interp + GGUF/CoreML export.
Ollama + MLX backend (v0.19, March 2026) — prefill 1154→1810 tok/s, decode 58→112 tok/s on Apple Silicon. Direct competition for our GGUF runner.
exo-explore/exo — multi-Mac P2P distributed inference. JACCL collectives over RDMA-on-Thunderbolt-5 on macOS 26.2 → 1.8×/3.2× speedup on 2/4 devices. Out of single-machine scope, but the infra is new.

Interpretability ecosystem (overlap with our interp lab):

SAELens — established SAE training/analysis library; integrates with TransformerLens + HF + nnsight + Neuronpedia. Our SAE may be reinventing; B18 task = compare + decide on interop format.
TransformerLens · nnsight (NDIF) — PyTorch interp infra; complementary to SAELens. We have native Swift/MLX equivalents.

Proprietary / out of scope: OpenAI o1 / o3 (closed-weights; reframed the field around test-time compute, no adoptable artifact). DeepSeek-V3 (671B-MoE, scale-blocked; informs MTP + MoE design). Qwen3 (model family, not a technique).

4.5 Coverage cutoff

The catalogue was hand-curated up to assistant knowledge cutoff January 2026 plus best-effort web-search additions for Feb-May 2026 (coverage spottier there). Today is 2026-06-04.

2026-06-04 web sweep folded in — five surfaces were checked (Apple Silicon training, nanoGPT successors, Mac inference runtimes, interpretability libraries, fine-tune frameworks). Results: nanochat

modded-nanogpt added as direct from-scratch peers; mlx-lm + Ollama-MLX + EXO added as Apple Silicon ecosystem peers; SAELens added as interp peer; B16-B20 queued in §3 from surfaced gaps; Unsloth Feb-2026 release notes folded into tools row. Coverage of Feb-Jun 2026 papers is now meaningfully better but still not exhaustive.

Future papers append row-by-row into §4.1 / §4.2 / §4.3.

Appendix — index of source docs absorbed by this file

This doc replaces the multi-file roadmap split. The source docs are kept for context but should be treated as historical; edit this file, not them.

Old doc	What it covered	Status
`docs/roadmap/index.md`	TOC for the multi-file split	Superseded — point at this file
`docs/roadmap/tier1.md` / `tier2.md` / `tier3.md`	ROI-tiered technique inventory	Absorbed; markers refreshed
`docs/roadmap/tier4_skip.md`	Intentionally-not-built items	Absorbed into §2
`docs/roadmap/tier5_frontier_2026.md`	2026 research frontier	Absorbed into §3 Tier 5
`docs/roadmap/categories.md`	Orthogonal technique taxonomy (had stale markers)	Absorbed; refreshed against code
`docs/roadmap/blockers.md`	What we can’t build + Phase 9/10 status appendix	Absorbed into §2 + §1
`docs/roadmap/phased_plan.md`	7-week sequential plan	Mostly shipped; remainder in §3
`docs/roadmap/recommended_order.md`	Top-10 next	Superseded by Tier A/B ordering in §3
`docs/roadmap/honest_summary.md`	”CAN / CAN’T / SHOULDN’T” framing	Absorbed
`docs/progress.md`	Mac+Web shipped dashboard	Absorbed into §1
`docs/backlog.md`	ROI-ordered “what’s left” (Tier A/B/C/D)	Absorbed into §3
`docs/feature_audit_2026_05_31.md`	CLI smoke audit	Cross-referenced; was the verification baseline
`docs/roadmap/recent_research.md`	Paper catalogue (2024-2026)	Absorbed into §4; archived at `docs/archive/recent_research.md`

Still canonical (deep dives, not absorbed): docs/roadmap/datasets.md, docs/roadmap/north_star_refined.md, and the per-technique docs (distillation.md, interpretability.md, moe.md, mtp.md, lora_guide.md, precision.md, memory_tradeoffs.md, perf_quest.md, decision_log.md). Those don’t duplicate planning — they explain how shipped pieces work.

TinyGPT — master plan

Status legend

1. SHIPPED

Mac runtime + CLI

Mac training + post-training

Training stability (verified 2026-06-02 — these were all marked ⬜ in older docs)

PEFT bundle

Inference + sampling

Quantization + compression

Optimizers

Architecture variants

Tokenization

Interpretability tools (browser playground)

Browser / Web track

WebGPU kernels (in webgpu/train*.wgsl)

Datasets + data pipelines

Tooling + infra

Headline metrics (Mac, M5 Pro / 48 GB)

Recent product surfaces (Wave 2.6, shipped 2026-05-31)

Learning artifacts (docs)

2. SKIPPED

❌ Superseded by better alternatives

❌ Dropped after audit (real cost, no payoff at our scale)

⏸ Deferred (waiting on external trigger)

🚧 Blocked by hardware

🚧 Blocked by upstream library state

🚧 Blocked by budget

3. TODO

Tier A — DO NEXT (north-star aligned; specialists)

Tier D — DATA (gaps blocking specialists)

Tier E — EVAL PIPELINES (wire harnesses → automate scores)

Browser viewers shipped 2026-06-05

Rust performance tools shipped 2026-06-05

Eval — runbook artifacts shipped 2026-06-05

Tier B — NEXT QUARTER (multi-specialist + product)

Tier C — POLISH (mostly shipped this session)

Tier 5 — RESEARCH FRONTIER (2026 stretch goals)

5.6 TTS toy — detailed scoping

5.7 Specialized explainer-video model — Lamina-like track

Unshipped techniques — after applying the value-add filter (and re-auditing)

Queued findings — ANE routing + Mac-vs-browser sampling

1. Browser sampling tok/s harness — CHEAP, ~30 min

2. ANE-routed inference via Mini-Llama TinyGPT — MEDIUM, 1-2 weeks

Why TinyGPT doesn’t route today

What to build

Realistic speedup expectations

Probability analysis

Cost-benefit (honest)

Why queued

Apple’s actual ANE landscape (for posterity)

4. Research absorbed — paper × verdict

4.1 Implemented (techniques we ship)

Alignment / preference

PEFT

Quantization

Inference / efficiency

Architecture variants

Optimizers

Distillation

Synthetic data

Test-time compute

Evolution Strategies

4.2 Cannot — blocked, parked, or skipped

🚧 Blocked by hardware

🚧 Blocked upstream

❌ Skipped — different family / not worth the seat

❌ Dropped — value-add filter (subsumed by what ships)

4.3 Planned — queued for a future training run

4.4 Reference reads (no verdict — context only)

4.5 Coverage cutoff

Appendix — index of source docs absorbed by this file

WebGPU kernels (in `webgpu/train*.wgsl`)