TinyGPT — master plan
Last verified against codebase: 2026-06-06 (eval-pipeline + serve fix + elf PRDs landed; product framing clarified to “Mac platform for building/upgrading specialists”)
Sources merged: docs/roadmap/* · docs/progress.md · docs/backlog.md · docs/feature_audit_2026_05_31.md · docs/roadmap/recent_research.md (paper catalogue → §4)
Product framing (clarified 2026-06-06): TinyGPT is a Mac platform for individuals to build and upgrade specialist models for their specific tasks — bring data, pick a local teacher, ship a fast/cheap specialist. Distillation + LoRA + QLoRA + constrained decoding are the toolkit. Local teacher = no API spend. Comprehensive multimodal roadmap (text/code/vision/voice/image-gen) under disciplined “one canonical best per slot” principle. Canonical strategy doc: docs/sessions/2026-06-06-mac-specialist-platform.md — covers Tier 1-4 backlog, multi-model architectures (phone-a-friend / cascade / LoRA hot-swap / etc.), structured-output formats beyond JSON (incl. Protobuf / SQL / GraphQL via grammar), and flagship example apps (browser agent, per-language code specialist, voice command, etc.).
Three sections — shipped, skipped, TODO. Every claim verified against the code. The first audit caught Lion/Sophia/Muon/PEFT-bundle/ gradient-clipping; the second caught YOCO + GPTQ-reader + token-elim (dropped under value-add filter); the third caught embedding RMSNorm, cosine warmup, layer-wise LR decay, DeepNorm, BPE-dropout, Real CI — all shipped, all previously marked ⬜.
Status legend
| Mark | Meaning |
|---|---|
| ✅ | shipped — verified against code today |
| 🟡 | partial / in-session-only / verified-with-caveat |
| ⬜ | TODO — in active backlog |
| ⏸ | deferred — would build but waiting on external trigger |
| ❌ | skipped — intentionally not built (better alternative exists) |
| 🚧 | blocked — would build but cannot right now (hardware / upstream / budget) |
1. SHIPPED
Mac runtime + CLI
Audit baseline: every CLI smoke-tested on M5 Pro 2026-05-31. See
feature_audit_2026_05_31.md for the full smoke trace. 30+ subcommands all green.
- ✅ Cold-start bundle (mmap + lazy embed + async load + compile cache) — 24 ms in-process TTFT on 1B
- ✅ KV cache (GQA + in-place + persistent across sessions)
- ✅ Pausable training (cooperative SIGINT + atomic save +
--resume) - ✅ Cross-process GPU lock (
~/.cache/tinygpt/gpu.lock) - ✅ CF R2 cloud save/load pipeline (push / pull / list / setup; zero egress)
- ✅
tinygpt serve— OpenAI + Ollama surfaces on the same socket - ✅
tinygpt agent— multi-turn + tool dispatch + persistent KV +--cloud-escalate - ✅ JSON-mode constrained generation (FSM token masking)
- ✅ Cloud API client (Anthropic + OpenAI via curl) + SSE streaming + cancellation
- ✅ Continue.dev / Ollama-compat provider (
/api/tags,/api/version,/api/show,/api/chat,/api/generate) - ✅
tinygpt escalate(direct cloud-API call)
Mac training + post-training
- ✅ Pretrain (
tinygpt train) — 42 ms/step Huge on M5 Pro, 17.2× browser - ✅ Finetune (
tinygpt finetune) - ✅ SFT (
tinygpt sft) — DoRA default + every PEFT variant - ✅ DPO / SimPO / KTO / ORPO (all in
tinygpt dpovia flags) - ✅ Knowledge distillation (
tinygpt distill) — KL teacher → student - ✅ Speculative-decoding head training (
tinygpt train-heads --type medusa|eagle) - ✅ Evolution Strategies trainer (
tinygpt es) - ✅ Tuned-lens trainer (
tinygpt tuned-lens) - ✅ Mini-router trainer (
tinygpt train-extractor) - ✅ Magpie synthetic-instruction generator (
tinygpt magpie) - ✅ Sequence packing for SFT
- ✅ NEFTune (noisy embeddings) —
--neftune-alphainsft+dpo(matches the paper’s “Noisy Embeddings Improve Instruction Finetuning” scope; not in the pretrain path) - ✅ Gradient clipping (
--grad-clip F, default 1.0, on train + sft + dpo) - ✅ z-loss auxiliary (
--z-loss-weight F) - ✅ Embedding tying (
tieEmbeddingsconfig flag) - ✅ Document-level shuffling (implicit via batch sampler)
- ✅ Gradient checkpointing (CustomFunction VJP workaround for missing
mlx_checkpoint) - ✅ QAT (in-training,
--qat) - ✅ Persistent tokenized cache (TokenCache.swift; wired into Train + Eval + Distill + Finetune)
Training stability (verified 2026-06-02 — these were all marked ⬜ in older docs)
- ✅ Embedding RMSNorm (
--embedding-rmsnorm/cfg.useEmbeddingRMSNorm) - ✅ DeepNorm residual scaling (
--deep-norm/cfg.useDeepNorm+deepNormAlpha/Beta) - ✅ Layer-wise LR decay (
cfg.lrLayerDecay) - ✅ Cosine warmup (
--lr-schedule cosine --warmup 500— the curated default) - ✅ BPE-dropout (
BPEDropout.swift— per-merge skip during encoding for regularization)
PEFT bundle
All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, all gated through tinygpt sft:
- ✅ LoRA · Multi-LoRA composition · LoRA+ (different LR for A/B)
- ✅ DoRA (in-session; on-disk format pending — see Tier C)
- ✅ VeRA · LoftQ · AdaLoRA · RsLoRA · PISSA · LoRA-FA · LayerDrop
Inference + sampling
- ✅ KV cache + flash-attention forward (
MLXFast.scaledDotProductAttention) + backward - ✅ Quantized inference (int4 / int8 via
MLXNN.quantize) - ✅ Speculative decoding (vanilla + Medusa + EAGLE-2 heads)
- ✅ Prefix / prompt caching
- ✅ Streaming-LLM attention sink
- ✅ KV cache quantization (KIVI)
- ✅ Multi-Token Prediction (MTP) inference path
- ✅ Multi-Query Attention (free via
nKvHeads: 1) - ✅ Sliding window attention (
--sliding-window N) - ✅ ALiBi position bias (
--alibi)
Quantization + compression
- ✅ HQQ (
tinygpt hqq— int4 q-then-dq, 0.087 rel error) - ✅ GPTQ (
tinygpt gptq— from-scratch int4 quant of own model, 0.102 rel error) - ✅ AWQ safetensors reader (loads HF AWQ-quantized models)
- ✅ GPTQ safetensors reader (
GPTQReader.swift— loads HF GPTQ-format models; tested 72 tensors quantised in 31s) - ✅ GGUF reader (
GGUFReader.swift+tinygpt gguf-inspect) — parses v2/v3 header + metadata + tensor inventory; dequantises F32 / F16 / Q4_0 / Q8_0 tensors to fp32. K-quants (Q4_K / Q6_K / etc.) slot into the same switch when needed. - ✅ SmoothQuant (in-training)
- ✅ Pruning — unstructured (
tinygpt prune-unstructured) + structured (tinygpt prune-structured) - ✅ LASER selective rank reduction (
tinygpt laser)
Optimizers
- ✅ AdamW · Lion · Sophia · Muon · Adafactor (all in
Optimizers.swift) - ✅ GaLore (gradient low-rank projection)
Architecture variants
- ✅ Standard transformer (RoPE + RMSNorm + SwiGLU + GQA)
- ✅ Sliding window · ALiBi · Multi-Token Prediction · MQA · GQA
- ✅ MoE (dense routing — sparse hard routing blocked, see §2)
- ✅ Mixture of Depths (soft sigmoid gate — hard routing blocked, see §2)
- ✅ Differential attention (
--diff-attn) - ✅ YOCO cross-layer KV sharing (
--yoco) — CrossAttention.swift module, second-half blocks reuse first-half K/V. Seedocs/yoco_results.md. (Was marked “designed only” in older audit — actually shipped.)
Tokenization
- ✅ Byte-level (vocab=256) — from-scratch path
- ✅ HF BPE / SentencePiece via swift-transformers
Interpretability tools (browser playground)
- ✅ Logit lens (button + worker route)
- ✅ Tuned lens (Mac CLI trainer +
.lensessidecar + browser upload) - ✅ Attention heatmap (“Watch the model think” panel)
- ✅ Per-layer ablation (“Ablate & sample” button)
- ✅ Activation patching — both variants (zero + donor-swap, shipped 2026-06-02 in
17021bc) - ✅ Linear probes (
tinygpt linear-probe) — train Linear(d_model → C) on per-layer hidden states + label data;.lpsidecar format. Detects whether a layer represents an arbitrary external property (Alain & Bengio 2016). - ✅ ROME — surgical fact editing (
tinygpt rome). Rank-1 update to one MLP’s W_out, identity-Hessian first cut. Verified on shakespeare.tinygpt:--target X --layer 11 --scale 10flipped sampled next-token to X. Covariance-based ROME is the follow-up. - ✅ MEMIT — batched fact editing (
tinygpt memit). Rank-K least-squares ΔW = R(KᵀK + λI)⁻¹Kᵀ via hand-rolled Gauss-Jordan inverse on the small N×N system. Verified math: per-fact residual ~1e-4 at scale=1 (machine noise — least-squares is exact). Single-layer visibility-in-sampling tradeoff documented; multi-layer MEMIT (distribute update across 5-7 mid-network layers) is the next-cut.
Browser / Web track
- ✅ Landing page +
/playgroundroute - ✅ WebGPU training pipeline (Huge / Mega presets via capability detection)
- ✅ Browser BPE scorer + gallery model loader
- ✅ Browser-side benchmark runner (“Run benchmark on your loaded model”)
- ✅ Doc consolidation — every doc visible at
/docs/[slug] - ✅ WASM SIMD (
-msimd128) — measured 1.6× - ✅ Multi-threaded WASM (pthreads + SAB) — measured ~2×
- ✅ Memory64 module (
tinygpt64.{js,wasm}) — partial: Node ok, browser blocked at d_model ≥ 256 (ABI bug, task #66) - ✅ Speedup curve vs WASM SIMD: Small 2.6× / Medium 6.8× / Large 9.3× / XL 12.1×
- ✅ WebNN active probe (
webnn_probe.ts, builds a tiny MLGraph and verifies it computes, drives the+WebNN (gpu/npu)pill state — 2026-06-02 in86433c3). Full transformer-as-MLGraph follow-up unblocked.
WebGPU kernels (in webgpu/train*.wgsl)
- ✅ Naive scalar matmul
- ✅ Blocked 4×4 matmul (
matmul_blocked_vec4) - ✅ Layer-norm subgroup variant (gated on
gpuFeatures.subgroups) - ✅ Cross-entropy subgroup variant
- ✅ Bias-grad subgroup variant
- ✅ FA2 forward in WGSL (flash attention in browser)
- ✅ f16-storage matmul (gated by
verifyF16Storage) - ✅ f16-compute matmul forward + backward (
train_f16_compute.wgsl, gated byverifyShaderF16Compute— 2026-06-02 in1ddf6ba/2cdedac) - ✅ Coop-matrix matmul (
train_coopmat.wgsl, gated byverifyCoopMatrix— 2026-06-02 in86433c3) - ✅ OPFS persistence
- ✅ Patch kernels (
patch_zero+patch_replace— 2026-06-02) - ✅ Subgroup matmul kernel (
matmul_sg/matmul_abt_sg— gate currently fails on M5 Pro, falls back to vec4)
Numerics-gate framework — every fast path (f16-storage, f16-compute,
coop-matrix, subgroup) carries its own gate that compares against a f32
reference with a magnitude-aware tolerance. Gate-fail → silent fallback,
zero regression risk. See docs/precision.md.
Datasets + data pipelines
- ✅
tinygpt list-datasets— 22 curated entries (tool-calling / debugger / code / math / reasoning) - ✅
tinygpt download-dataset(canonicalhf://datasets/owner/nameform) - ✅ HF Datasets / Hub integration (
hf-load,hf-inspect) - ✅ GitHub data fetcher (
fetch-github— issue→PR pairs) - ✅ Magpie synthetic instruction generator
- ✅ Extractor-data pipeline (
extractor-data— BFCL/τ-bench →{query, tool}pairs) - ✅ Indic eval pipeline (
eval-indic— MILU MCQ + IndicGenBench-XQuAD, smoke-validated)
Tooling + infra
- ✅ XCTest harness + swiftformat + lint CI (Mac)
- ✅
tinygpt inspect/validate(round-trip byte-compare verified on 110 MB model) - ✅
tinygpt bench(TTFT/ITL/decode tok/s/peak RSS) +tinygpt bench-train - ✅
tinygpt eval/score-bench(loss + benchmark scorers) - ✅
tinygpt compare(side-by-side base vs LoRA-adapted) - ✅
tinygpt debug-*(dtypes / load / logits / loss / names helpers) - ✅
tinygpt screen tree(AX tree readout — focused-window JSON) - ✅ lm-evaluation-harness MLX adapter
Headline metrics (Mac, M5 Pro / 48 GB)
| Value | Target | Headroom | |
|---|---|---|---|
| TTFT (warm) | 5.8 ms p99 | < 50 ms | ✅ 10× under |
| ITL p99 | 4.9 ms | < 30 ms | ✅ 6× under |
| Decode tok/s | 293 (mega-pilot 960M) → 696 (huge 221M) | > 50 tok/s | ✅ 6× over |
| Cold start TTFT | 24 ms (1B) | < 50 ms | ✅ 2× under |
| Training Huge | 42 ms/step | (baseline) | — |
| Speedup vs browser | 17.2× | (baseline) | — |
| Largest model | 960 M params (1.1 GB) | — | — |
Recent product surfaces (Wave 2.6, shipped 2026-05-31)
- ✅ Cloud-escalate wired into AgentLoop
- ✅ Continue.dev / Ollama-compat provider
- ✅ Tool-call extractor (mini-router) scaffold — ToolRouterModel + CLI pipeline
- ✅ ScreenCaptureKit + macOS Accessibility scaffold — AX tree works end-to-end from CLI
Learning artifacts (docs)
- ✅
docs/decision_log.md— every architectural decision logged - ✅ Research bundles in
docs/research/(inference + quality benchmarks, kernel audit, mac decode baseline, wave-4 landscape, Indic evals) - ✅ Session retrospectives (e.g.,
session_2026_05_31.md) - ✅ Per-technique deep-dives (
distillation.md,evolution_strategies.md,moe.md,mtp.md,lora_guide.md,interpretability.md, etc.)
2. SKIPPED
❌ Superseded by better alternatives
- fp16 mixed-precision training — bf16 strictly better, shipped
- ZeRO / FSDP / pipeline parallelism — multi-device only
- State space models (Mamba / RWKV) — different architecture; ~2-3 week port; better as side-project
- PagedAttention / continuous batching — multi-user inference only
- Tree attention / lookahead decoding — marginal over speculative
- Adapter modules (Houlsby / Pfeiffer) — LoRA’s older cousin, superseded
- BitFit — train biases only; quality is poor
- IA³ — element-wise scaling; superseded by LoRA family
- Hyena / long-conv — different architecture
- fp8 training — needs H100 / Blackwell hardware
❌ Dropped after audit (real cost, no payoff at our scale)
- Flash Attention Metal kernel — MLXFast SDPA already fused (
docs/research/wave_2_5_kernel_audit.md§1) - Int4 packed matmul Metal kernel — MLX
quantized_matmulalready hand-tuned (§3) - General SWE-bench leaderboard chase — Sonnet 4.6 dominates regardless of wrapper; play local-first / on-device game instead
- Tinker cloud fine-tune as differentiator — use if needed; not a project differentiator (budget-ruled-out for solo)
- Hooking into Apple App Intents — no public API for third-party LLMs to replace Apple’s FM
⏸ Deferred (waiting on external trigger)
| Item | Trigger | Why deferred |
|---|---|---|
| cider W8A8 adoption | a 3B+ specialist ships | At ≤ 1B, Mac already 10× under realtime; cider’s prefill win is immaterial |
| ANE + GPU heterogeneous routing | Apple ships Stateful Models API (rumored late 2026) | Research-grade; current path uses private ANEMLL APIs |
| WebGPU subgroup matmul redesign | browser focus returns | Current gate fails (1415% mean_rel); fallback works |
| Vision encoder (ViT → tinygpt decoder) | vision-specialist demand becomes concrete | 2-week research-grade work; not critical-path |
| Audio I/O (Speech.framework + AVSpeechSynthesizer) | voice-mode demo becomes priority | Not in scope for Wave 3 |
| Async tool-call dispatch | parallel-tool specialist ships | LM dominates 5-100× over subprocess at current scales |
| ScreenCaptureKit raw image (CGS-init fix) | vision specialist needs raw bytes | AX tree sufficient for tool-calling specialists |
| Public launch (HF + writeup + HN) | ≥ 1 specialist beats a fair baseline | Nothing to launch yet |
| Phase 7 browser perf (subgroups / coop-matrix / WebNN) | post-HN v2 push | Current 12.1× lift is the launch story |
🚧 Blocked by hardware
- Distributed training (ZeRO, FSDP, pipeline-parallel) — single device only
- Native FP4 training — Apple M-series lacks FP4 tensor ops
- Native FP8 training — same
- Hardware-accelerated MoE routing — Apple silicon has no sparse-routing ops
- ANE training acceleration — ANE is inference-only
🚧 Blocked by upstream library state
| Item | Blocker | Workaround |
|---|---|---|
| QLoRA real-quantized base + LoRA | MLX-Swift quantized arrays don’t autograd through | Manual fake-quant in fwd (pedagogical, no memory win) |
| Sparse MoE hard routing | MLX-Swift no scatter_add | Soft (dense) routing shipped |
| Mixture-of-Depths hard top-K | same | Soft sigmoid gate shipped |
| Fast BPE encoding | swift-transformers single-threaded; 2 GB corpus = ~30 min | Rust-backed encoder via FFI (future) |
| Native int4 / int8 WebGPU matmul | spec doesn’t yet have quantized matmul extensions | Wait for subgroup / coop-matrix extensions |
| GGUF safetensors reader | not yet written | Could write (~2 days); AWQ + GPTQ readers already ship |
🚧 Blocked by budget
- Synthetic SFT via frontier API ($1-10K) — use open-weights teacher via Magpie instead
- Multi-TB dataset downloads — stream subsets (the HF importer does this)
- Strong local judge for Constitutional AI / RLAIF — no 70B+ runs usable on a Mac
- Public RLHF / PPO pipeline — 5× the code of DPO + 10× iteration; DPO covers 80-90% of the lift
3. TODO
ROI-ordered. Sourced from backlog.md (the living list, last sort 2026-05-31).
Tier A — DO NEXT (north-star aligned; specialists)
Until A1 lands, every optimization is theoretical. Until Tier E (eval pipelines) lands, every specialist is unmeasurable — A1’s “ship” criterion implicitly requires E1 + E3 wired before any score can be published. Sequencing: Tier D (data) + Tier E (evals) → A1 specialist → Tier B follow-ups.
- ⬜ A1. Train first specialist end-to-end (tool-caller) — 3-5 days execution + GPU hours. Validates north-star thesis.
- ✅ A2. Pull foundational datasets (DONE 2026-06-17) — all on disk: xlam-function-calling-60k, hermes-fc, function-calling-chatml, SWE-bench_Verified, alpaca-cleaned, orca_dpo_pairs, MetaMathQA, ultrafeedback, the-stack-smol (8 langs), python_code_instructions_18k_alpaca, all under
~/.cache/tinygpt/datasets/. Inventory:docs/dataset-inventory.md. - ⬜ A3. Fetch GitHub issue→PR corpus for debugger — ~1 day with
GITHUB_TOKEN - ⬜ A4. Pull BFCL + τ-bench via extractor-data — ~30 min (DONE — sources at
~/.cache/tinygpt/datasets/_external/{gorilla-bfcl,tau-bench}/; wiring is Tier E, not Tier A) - ⬜ A5. Pull Indic eval datasets (MILU + IndicGenBench-XQuAD) — ~30 min (DONE — MILU is lm-eval-harness, source at
_external/MILU/; wiring → E3) - ⬜ A6. Dataset inventory doc — ~30 min after A2-A5
- ⬜ A7. Real-data MILU baseline on flagship-huge-v5 — ~2 hr; depends on A5 + E3
Tier D — DATA (gaps blocking specialists)
Pulled today: hermes-fc.jsonl, ultrafeedback.jsonl, MetaMathQA, alpaca-cleaned, orca_dpo_pairs, FineWeb-Edu (50K-row sample via parquet decoder). Blocked / missing for the planned specialists:
- ✅ D1. xlam-function-calling-60k (DONE 2026-06-17) —
~/.cache/tinygpt/datasets/Salesforce/xlam-function-calling-60k/xlam_function_calling_60k.json(91.7 MB, ~60K rows). Required bothHF_TOKENand per-account license click-through at the dataset page. - ✅ D2. function-calling-chatml + SWE-bench_Verified (DONE) —
~/.cache/tinygpt/datasets/Locutusque/function-calling-chatml/(102 MB parquet) +princeton-nlp/SWE-bench_Verified/(2 MB). - ✅ D3. MS-MARCO + Natural Questions subset (DONE 2026-06-17) —
microsoft/ms_marco/v1.1/(3 shards, 207 MB: test+train+val) +google-research-datasets/natural_questions/default/(2 train shards of 287, 375 MB — subset bounded for B25 training data; full corpus is multi-GB). - ✅ D4. the-stack-smol + python_code_instructions_18k_alpaca (DONE 2026-06-17) — python alpaca:
iamtarun/python_code_instructions_18k_alpaca/(10.8 MB parquet, 18 612 rows decoded). the-stack-smol: 8 languages pulled (c 117 MB, c++ 147 MB, go 107 MB, java 67 MB, javascript 130 MB, python 83 MB, rust 132 MB, typescript 69 MB ≈ 850 MB total). Required license click-through at the dataset page. - ✅ D5. GSM8K + MATH + HumanEval + MBPP eval splits (DONE 2026-06-17) —
openai/gsm8k/main/(test+train parquet) +HuggingFaceH4/MATH-500/test.jsonl(500 rows, the canonical math eval set) +openai/openai_humaneval/+google-research-datasets/mbpp/full/(prompt+test+train+val).
Tier E — EVAL PIPELINES (wire harnesses → automate scores)
Source code for BFCL / τ-bench / lm-eval-harness is already on disk under
~/.cache/tinygpt/datasets/_external/. Pulling source ≠ usable evaluator.
Each item below is the wiring work — a tinygpt eval-<name> subcommand that
takes a model path, runs the harness via subprocess, parses the score JSON,
returns a clean number. Until these land, “did the specialist learn anything?”
has no automated answer.
Architectural constraint (decided 2026-06-05): every E* item MUST emit structured JSONL conforming to a shared eval schema (E0). That makes two critical comparisons possible:
- Cross-model: TinyGPT vs SmolLM2 vs Qwen3 vs Phi-mini on the same task — without this, “we trained a model” doesn’t answer “is it any good?”
- Cross-checkpoint (training dynamics): every save-history checkpoint scored against the same task → see WHEN a capability emerges. Pairs with B13 interp-on-checkpoints — interp explains WHY features appeared, eval confirms IF they’re useful.
Both fall out for free if E0 + E8 are designed in, not retrofitted.
- ✅ E0. Shared eval JSONL schema +
tinygpt eval-compare(SHIPPED 2026-06-05) —Sources/TinyGPT/EvalCompare.swift. CodableRowwith snake_case JSON. Three view modes:--by step/--by model/--by task. Sample artifact atdocs/artifacts/emergence-smoke-2026-06-05.jsonl. - ✅ E1.
tinygpt eval-bfcl <model>(SHIPPED 2026-06-05) —Sources/TinyGPT/EvalBFCL.swift. Bootstinygpt serve, invokesbfcl_eval._llm_response_generation+bfcl_eval.eval_checker.eval_runnervia subprocess with OpenAI-compatible base URL. Default 10 BFCL categories. PRD:docs/prds/E1-bfcl-eval.md. Unblocks A1. - ✅ E2.
tinygpt eval-tau-bench <model>(SHIPPED 2026-06-05) —Sources/TinyGPT/EvalTauBench.swift. Retail + airline envs. Configurable user simulator model. PRD:docs/prds/E2-tau-bench-eval.md. - ✅ E3.
tinygpt run-lm-eval <model>(SHIPPED 2026-06-05) —Sources/TinyGPT/RunLmEval.swift. Two modes:--hf-model <id>(baseline scoring via HF transformers) and--tinygpt-model <ckpt>(bootstinygpt serve+ routes lm-eval vialocal-completionsfor our actual forward pass).tinygpt servelearnedscoreLogprobsfor echo+logprobs requests. Smoke-tested cross-checkpoint + cross-model emergence sweep. - ⬜ E4.
tinygpt eval-gsm8k <model>— standalone scorer. Parse model’s final numeric answer, compare to gold. Tiny — covered by E3 if lm-eval-harness lands, but a standalone fallback gets you a number in ~half-day if E3 slips. May be unnecessary — E3 via local-completions should handle gsm8k; will validate on first post-N02 sweep. - ✅ E5.
tinygpt eval-humaneval <model>+ sandbox (SHIPPED 2026-06-05) —Sources/TinyGPT/EvalHumanEval.swift+ Rust crate atscripts/humaneval-sandbox/(macOS sandbox-exec policy atmacos-sandbox.sb). HumanEval + MBPP suites. PRD:docs/prds/E5-humaneval-sandbox.md. - ⬜ E6.
tinygpt eval-scaledown <model>— clone ScaleBench, wire to TinyGPT-loaded model, run. Prereq for B25 submission. ~half-day after E1’s subprocess pattern is the template. Seedocs/recipes/b25-scaledown.mdfor the training-side plan. - ✅ E7.
tinygpt judge <out.jsonl> --judge-model <model>(SHIPPED 2026-06-05) —Sources/TinyGPT/JudgeShim.swift. Two modes:pairwise(chosen-vs-rejected) andrate(1-10 score). PRD:docs/prds/E7-judge-shim.md. - ✅ E8. Train-time eval hook + dashboard plot (SHIPPED 2026-06-05) —
--eval-every N --eval-tasks csv --eval-limit Nflags intinygpt train. Spawns backgroundrun-lm-evalper checkpoint, appends to<out-stem>-evals.jsonl. Non-blocking; skips if previous eval still in flight. PRD:docs/prds/E8-train-time-eval-hook.md. Post-training equivalent:scripts/score-run.sh. - ❌ E9. Prompt-tiering A/B on the planner unhappy suite — RAN 2026-06-13 against google/gemma-3-12b on the n=130 drill. Hypothesis refuted. Compact (name-only) action index made every dim worse: ambig 11/40 → 7/40 (-10pp), oos 51/60 → 47/60 (-6.7pp), destructive 24/30 → 23/30 (-3.3pp). The failure-pattern diff is mostly silent (one ambig pattern shrank 12→11; one new ⚠ pattern in B) — meaning compact mode doesn’t introduce new failure modes, it just makes existing intent-mismatch confusions worse. Interpretation: at our scale (12B + 12-action surface), Gemma needs the schemas to confidently route non-action intents; the schemas are evidence for what kinds of requests pace can do, which sharpens both “this is an action” and “this is NOT an action” judgments. Steal #2 from the Shortcut essay is not portable to Pace as stated. v11 stays the default; v11-compact retained at
grammars/pace-system-prompt-v11-compact.txtfor re-test against larger catalogs (e.g. App Intents-class surfaces) where the schema budget shifts the tradeoff. Updateddocs/learn/agent-context-hierarchy.mdSteal #2 with this verdict.
Browser viewers shipped 2026-06-05
| Page | Role |
|---|---|
/eval-leaderboard.astro | drag-drop E0 JSONL → 3-view comparison (by step / model / task) |
/sae-timeline.astro | drag-drop B13 SAE timeline JSONL → MSE-over-step + L0-over-step charts |
Rust performance tools shipped 2026-06-05
| Crate | Role |
|---|---|
scripts/parquet-decoder/ | replaces python3 scripts/parquet_to_txt.py; static binary, no pyarrow |
scripts/hf-downloader/ | parallel HF shard fetches with progress + retry + resume |
scripts/humaneval-sandbox/ | E5 supporting sandbox runner (Rust + macOS sandbox-exec) |
Eval — runbook artifacts shipped 2026-06-05
| Script | Role |
|---|---|
scripts/score-checkpoint.sh | one .tinygpt → E0 JSONL row(s) via lm-eval |
scripts/score-run.sh | every history checkpoint of a run + SmolLM2 baseline → JSONL + 3-view summary |
scripts/sae-run.sh | SAE-per-checkpoint sweep → JSONL timeline (B13 v2 input) |
scripts/score-baselines.sh | 5 HF baselines (SmolLM2-135/360M, Qwen3-0.6B, TinyLlama, Phi-3-mini) on the same task set |
Total Tier E: ~6-8 focused days. Do E0 first (schema is everyone’s dependency), then E3 (highest harness leverage), then E1 (A1 ship-blocker), then E8 (multi-checkpoint), then the rest as nightly arcs.
Tier B — NEXT QUARTER (multi-specialist + product)
- ⬜ B1. Second specialist (shell or SQL) — 3-5 days; depends on A1
- ⬜ B2. Mini-router on real BFCL data — ~half day after A4
- ⬜ B2b. Bake-off — classifier-head router vs pure-GPT-with-FSM — settles whether architectural deviation is justified
- ⬜ B3. FSM constraint-injection from router prediction — ~3 days; depends on B2
- ⬜ B4. Tool-call eval harness (subprocess refactor for BFCL/τ-bench) — ~half day
- ⬜ B5. Cloud-escalation training signal (
{"defer_to_cloud": true}) — ~1 week - ⬜ B6. Mac app demo — ~1 week; depends on A1
- ⬜ B7. Specialist routing model — 1-2 weeks; depends on B1
- ⬜ B8. Multilingual specialist (Sarvam-Edge / Airavata base) — 1-2 weeks; depends on A7
- ⬜ B9. Energy J/token measurement (needs sudo for
powermetrics) — ~1 day
Pretrain + runtime quality (added 2026-06-04 — “good product” lens, not launch optics):
- ⬜ B10. Quality classifier on pretrain data (FineWeb-Edu-style) — tiny fastText classifier on educational-quality labels, score corpus, keep top X%. Highest direct quality lift per dev-day. ~2 days. See §4.3.
- ✅ B11. WSD schedule (warmup-stable-decay) (SHIPPED) —
--schedule wsd --decay-steps NinSources/TinyGPT/Train.swift. Linear warmup → stable plateau → linear decay. Replaces cosine; the decay phase IS the annealing knob. - ✅ B12. Loss-spike recovery + replay (SHIPPED) — spike detector on by default;
--no-spike-detectopts out. Grad-norm tracker triggers auto-rollback + LR drop. InSources/TinyGPT/Train.swift. - 🟡 B13. Interp-on-checkpoints (partial —
--save-every Nshipped intinygpt train; multi-checkpoint replay tooling still pending) — replay SAE / MEMIT /tinygpt patchacross the multi-checkpoint timeline. Checkpoint emission is in code; the analysis-side batch driver is the open part. See §4.3. - ✅ B14. Speculative decoding (Mini-Llama draft for Mega) (SHIPPED) —
Sources/TinyGPT/SpeculativeDecode.swiftimplements Leviathan et al. 2023 (simplified). Greedy speculative; speedup is K-ish on benign branches. - ✅ B15. Layer-wise LR decay for SFT (SHIPPED) —
--lr-layer-decay Fflag intinygpt train. Each block’s gradient is multiplied byfactor^(L - 1 - i)so deeper layers get the full LR.Sources/TinyGPTModel/Trainer.swiftexposeslrLayerDecayas graph-pure scalar multiply per leaf; smart default 0.85 in certain training modes.
Competitor-aware additions (added 2026-06-04 — surfaced by web sweep, not Jan-2026 cutoff knowledge):
- ⬜ B16. M5 Neural Accelerator prefill benchmark + bump — verify the claimed 3.5×–4× M5-vs-M4 prefill speedup is materializing on TinyGPT’s MLX path. Current pin:
mlx-swift 0.31.3on macOS 26.5 / M5 Pro (well past the 26.2 floor). Bump to latest (0.31.4) and benchmark. ~half-day. Free win if it’s already on; bump is reversible. See §4.3. - ✅ B17. SAE Lens interop / Neuronpedia format export (SHIPPED, option c) —
tinygpt sae-to-saelens <in.sae> --out <dir>converts to SAELens (decoderesearch/SAELens) on-disk layout; Neuronpedia consumes the same.Sources/TinyGPT/SaeToSaelens.swift. - ✅ B18. nanochat-style
--depthsingle-knob HP derivation (SHIPPED) —--depth NinSources/TinyGPT/Train.swiftderives the GPT-2-shaped width / heads / LR / batch / steps from one knob. - ✅ B19. Group-SAE (layer-group SAE training) (SHIPPED) —
tinygpt sae --layer-group A,B,Ctrains ONE SAE on the union of residuals across the listed layers (mutually exclusive with--layer). Provenance round-trips through SAELens export viatinygpt_is_group_sae=truemetadata key. - ⬜ B20. Investigate learnable cross-stream attention (EVALUATED 2026-06-17 — verdict: skip; revisit on scale or paper) — full read-and-evaluate write-up at
docs/research/cross-stream-attention-evaluation.md. Gain is ~3–6% wall-clock at speedrun scale on FineWeb; not visible at our scales, interacts non-trivially with B14 spec-decode + GaLore + DoRA TGLA + ANE M8, and not yet a paper. Revisit if (a) TinyGPT ships a from-scratch ≥50M FineWeb-class run, (b) the speedrun PR gets a formal ablation write-up, or (c) the interaction-surface items land formally so the boilerplate is paid. - ⬜ B21. Micro-AutoMixer for specialist data mixes — Poolside-style data mixture optimization, scaled down: train 6-12 proxy runs across code/math/tool/web ratios, score on fixed capability evals, fit a simple surrogate, then propose the next mix. Do this before expensive specialist training so data ratios stop being hand-wavy. ~2-3 days plus small proxy runs. See §4.3.
- ✅ B22. Token-preserving agent trajectory recorder (SHIPPED + verified 2026-06-17) —
tinygpt agent --trajectory-dir <dir>writes one.atrajJSON file per rollout with per-step role / decoded content / raw token IDs (input_idsfor fed text,output_idsfor sampled assistant text) / structured tool-call args / structured tool results / rewards. Format + reader API:Sources/TinyGPTModel/AgentTrajectory.swift. Threading:Sources/TinyGPT/AgentLoop.swift(recorder hooks at every turn boundary;finishTrajectory(summary:)flushes on session end). CLI:Sources/TinyGPT/Agent.swift(--trajectory-dir,--trajectory-task). 3 unit tests (roundtrip byte-equality of input_ids/output_ids, recorder lifecycle, empty-trajectory + auto-mkdir) — all pass. Docs:docs/agent_runtime.md§“Token-preserving trajectories (B22)”. Unblocks B29. - 🟡 B23. Agent eval protocol hardening — statistical-reporting + budget-metadata slice shipped 2026-06-18:
tinygpt eval-gate --passes Know gates on K-pass means and preserves per-trial scores, stdev, stderr, and 95% CI ingate-result.json;--budget evals/sample-budget.jsonattaches fixed max steps, sandbox resources, sampling params, seed, and infra patches under the report’s"protocol"block. Swift eval rows emitted viaEvalHarnessSupport.appendRownow attach the same protocol metadata when--budget/TINYGPT_EVAL_BUDGETis present, witheval-gateforwardingTINYGPT_EVAL_PASSES. Remaining: Pace unhappy Python rows, future SWE-mini/Terminal-mini rows, eval-compare error-bar rendering, plus actual sandbox/resource enforcement. PRD:docs/prds/B23-agent-eval-protocol.md. - ⬜ B24. Muon re-benchmark at 1B+ or skip — Poolside reports Muon giving a large-step efficiency win at scale with distributed overhead below 1%; TinyGPT’s current Muon smoke loses badly at small scale. Do not promote it until a ≥1B-ish run or a proxy matmul-dominated benchmark shows the overhead is amortized. ~half-day once a large run exists.
- 🟡 B26. Server-side deferred tools in
tinygpt serve(scaffolding shipped 2026-06-13; BFCL parity gate pending) —tinygpt serve --tool-mode {full,deferred}ships.fullis default and byte-for-byte identical to today.deferredswaps inServeToolsSpec.compactSystemPrompt()(one-line-per-tool index +get_tool_info(name)contract) andcompactGrammarSpec()(verb enum extended withget_tool_info). Non-streaming/v1/chat/completionsinterceptsverb=get_tool_info, appends a synthetic tool result with the schema, and re-prompts (cap=3 hops). Streaming + Ollama emit the meta-tool verbatim — documented in the PRD, not a bug. PRD:docs/prds/B26-deferred-tools.md. Unit tests:DeferredToolsTests.swift.tinygpt eval-bfclnow passes--tools/--tool-modethrough to its managed server, and a one-sample demo-model full-vs-deferred smoke completed 2026-06-19. Default flips on after: BFCL avg of--tool-mode deferredwithin ±2pp of--tool-mode fullon the real specialist run, with ≤2get_tool_inforound-trips per sample.
Castform-inspired training-pipeline trio (added 2026-06-13 from docs/learn/castform-rl-finetune.md):
-
🟡 B28. Composite reward framework with named dimensions (scaffolding shipped 2026-06-13) —
CompositeReward+RewardDimension+CompositeRewardBuilderinnative-mac/Sources/TinyGPTModel/CompositeReward.swift(6 unit tests, all passing). Castform-pattern (docs/learn/castform-rl-finetune.md§1). Training-loop integrations (DPO--reward-fn, ES, GRPO 5.1) are the remaining work; viewer (C10) gains per-dim curves. PRD:docs/prds/B28-composite-reward-framework.md. -
🟡 B29. Trace-to-training-data pipeline (V1 shipped 2026-06-17 —
--mode sft+ tool-echo drop + exact dedup + MinHash near-dedup) —tinygpt traces-to-data <atraj-dir> --task <t> --out <jsonl>consumes B22.atrajrollouts (Sources/TinyGPT/TracesToData.swift). Smoke:evals/traces-to-data-smoke.sh+ 5-trajectory fixture, asserts post-filter row counts + per-stage filter stats +--no-tool-echo-drop/--minhash-threshold 0.6/--dry-run/--judge-model(reserved + rejected). Recipe:docs/recipes/from-traces.md. V2 follow-ups: wiretinygpt judge(E7) as a subprocess for the LLM-pivot judge step; add--mode dpo(reward-source or judge-margin); external observability ingest (Braintrust / Langfuse). PRD:docs/prds/B29-trace-to-training-data.md. -
🟡 B31. Unified model gallery + project-level model pins (scaffolding shipped 2026-06-13; first specialist package registered 2026-06-19) — extends
browser/src/gallery-schema.tswith akinddiscriminator (browser-bin/mac-tinygpt/mac-adapter/mac-gguf/mac-safetensors-hf) so one published manifest covers browser + Mac models. Newtinygpt.project.jsonper-project pin file (package.json-style). Swift mirrors + 11 unit tests pass in this PR (GalleryManifest.swift,ProjectManifest.swift).specialists/qwen3-4b-file-ops-distillednow provides the first TinyGPT-built package: model card, prompt, eval report, artifact lock, and MLX validation helper for the fused file-ops distilled 4B.tinygpt pull+tinygpt validateCLI extensions + browser UI filter remain. PRD:docs/prds/B31-gallery-and-project-pins.md. The trace-loop dividend: project pins flip the Castform asymmetry — pinning + serving locally means the project owner naturally accumulates.atrajtraces (B22) that B29 turns into training data. The substrate-refinement cycle closes here. -
✅ B30. Prompt reasoning-depth classifier (shipped + verified 2026-06-17) —
tinygpt reasoning-classify --train|--score|--filterlabels prompts as {single-hop, multi-hop, comparison, other}. Bag-of-trigram softmax-4 (the FineWeb-Edu shape extended to multiclass),TGFRon-disk format. Files:Sources/TinyGPT/ReasoningClassify.swift, subcommand wired inTinyGPT.swift. Smoke:evals/reasoning-classifier-smoke.sh+evals/reasoning-classifier-fixtures/{train,heldout}.jsonl. Smoke result: macro-F1 1.000 on the 32-row held-out (well above PRD’s 0.5 bar); score + filter modes verified. Recipe:docs/recipes/balanced-training-mix.md.BagOfNgramClassifiershared utility deferred — V1 duplicates the tokenize/hash/ngram block fromQualityClassifier. PRD:docs/prds/B30-prompt-reasoning-classifier.md.
Market-landscape positioning (added 2026-06-13 — see docs/sessions/2026-06-13-market-landscape-mac-first.md):
The competitive scan found the whole field monetizes the cost a Mac-first tool zeroes out (cloud GPU rent / trace ingestion) and is consolidating into infra + frontier-lab acquirers. Three whitespaces: Mac-first training as a product (B6 + B31), eval+interp+local fused (already shipped — the moat), and academic agent benchmarks as a local CI gate (B32). These two items reframe shipped infra as product surfaces.
-
🟡 B32.
tinygpt evalas a CI / pre-commit gate (shipped 2026-06-13; K-pass stats + budget metadata added 2026-06-18; live multi-suite GPU run pending a self-hosted runner) —tinygpt eval-gateruns declared suites vs a baseline and exits non-zero on regression. Pure gate logic inTinyGPTModel/EvalGate.swift(direction heuristic, pp thresholds, per-suite override, missing-baseline handling, K-pass mean + stdev/stderr/95% CI + optional protocol budget) with unit tests; CLI orchestration inSources/TinyGPT/EvalGate.swift(--candidateno-GPU path,--update-baseline,--passes,--budget,gate-result.json). Spec lives ineval-gate.jsonor thetinygpt.project.jsonevalblock (B31 schema add). GitHub Action.github/actions/tinygpt-eval-gate/, recipedocs/recipes/eval-gate.md, smokeevals/eval-gate-smoke.sh(asserts exit 0 match, exit 1 regression, repeated-run stats, and budget metadata). Flips to ✅ once a real specialist’s suites run end-to-end through the gate on a self-hosted Mac runner. PRD:docs/prds/B32-eval-ci-gate.md. -
⬜ B35. Local-agent vertical PoC — code reviewer on a Mac — future-looking kill-or-validate experiment surfaced 2026-06-17 by the Vercel Eve launch. Eve validates the agent-platform thesis but is cloud-bound; tinygpt already owns ~70% of a local-agent stack (B22 + B26 + B28 + B29 + B30 + B32 + QLoRA + serve) and can target the wedge Eve doesn’t address: zero-cloud, specialist-distilled agents that run entirely on the user’s Mac. PRD:
docs/prds/B35-local-agent-vertical-poc.md. Kill criterion: 4-week timebox, ≥5pp lift over zero-shot open baseline on the chosen code-review eval, else publish the negative result and keep tinygpt narrow as a model factory. -
🟡 B33.
tinygpt quickstart— data → trained specialist in one command — CLI wizard: inspect data → auto-pick base from gallery → infer recipe → train → eval vs base → drop into chat. The CLI sibling of B6’s GUI Factory tab; closes the gap between “MLX-LM can technically do this” and “a non-ML-engineer actually does it.” PRD:docs/prds/B33-laptop-finetune-onboarding.md. Status: decision core (RecipeResolver, pure + unit-tested) + CLI (Quickstart.swift+ dispatch) +--dry-runplan & project.json emission +evals/quickstart-smoke.sh+docs/quickstart.mdshipped. Live train→sample path wired (orchestratessft/sample); user runs it on a Mac. Follow-ups: from-scratch raw-text path, auto-pull of bare gallery ids, quantitative eval-vs-base delta.
External-leaderboard arc (added 2026-06-05 — first public competitive submission target):
-
⬜ B27. Mac SLM agentic leaderboard v0 (scaffolding shipped 2026-06-13) — one publication-shape artifact at
docs/research/mac_slm_leaderboard_v0.mdcross-cutting BFCL + τ-bench + Pace unhappy-paths + decode tok/s + peak RSS.scripts/eval_slm_full.sh <model-id> <tag>runs all four suites against one LM Studio model;scripts/build_slm_leaderboard.py --manifest …rebuilds the table. Composite = accuracy × speed × cost, citingscore_formula.py(DRY — no re-derivation in the doc). Status flips to ✅ when ≥2 models land on the board so the table actually compares something. First-model target: gemma-3-12b-it (Run 5 indocs/research/mac_decode_baseline_m5pro.md). -
⬜ B25. ScaleDown Challenge specialist — extractive context compression — train a task-specific SLM that takes
(query, long_context)and returns the subset of sentences relevant to the query. Token-level relevance classifier head on the residual stream → sentence-level aggregation → threshold-keep. Training data: MS-MARCO + Natural Questions + similar (query, doc, answer) triplets with teacher-labeled per-sentence relevance scores; teacher can be a local Qwen/SmolLM. Eval via ScaleBench (their open-source harness, downstream F1/EM after compression). Submit to the ScaleDown Challenge leaderboard. ~3-5 days end-to-end: dataset pulls (~1 hr, reusetinygpt download-dataset), teacher-labeling pipeline (~half-day), classification-head module inSources/TinyGPTModel/(~half-day), newtinygpt compresssubcommand with token-level BCE loss (~1 day), ScaleBench integration + submission (~half-day). Pairs naturally with A1 (different domain, same A-track shape) and gives TinyGPT a public proof-point — “competitive task SLM trained from scratch on a Mac” — with an external scoreboard. See §4.3.
Tier C — POLISH (mostly shipped this session)
- ✅ C1. CLI cosmetic fixes — 27 subcommands now
exit(0)on--help;bench-train --helpshows correct name. Shipped 2026-06-02 in49dead5. - ✅ C2. Roll up pre-switch CLI shims into main switch — 17 shims absorbed; TinyGPT.swift -170 LoC. Shipped in
49dead5. - ✅ C3. DoRA on-disk adapter format (SHIPPED 2026-06-09) — TGLA v2 (magic
TGLA, version 2) adds optional per-entry[out]magnitude vector afterloraB; v1 readers ignore it; v2 readers autodetect. SeeSources/TinyGPTModel/LoraIO.swiftheader. Memory: [[project_dora_fix_shipped_2026_06_09]]. - ✅ C4. Tool-call extractor: BPE tokenizer support (SHIPPED 2026-06-17) —
tinygpt train-extractor --tokenizer <hf-dir>routesencode()through HFTokenizer instead of UTF-8 bytes; the tokenizer path is persisted in the checkpoint header (tokenizerSource).ToolRouterLoader.loadsurfaces it oncfg.tokenizerSource;Agent.swiftautoloads the same tokenizer when the router declares one;AgentLoop.RouterHookgains an optionaltokenizerfield andpredictWithRouteruses it when set. Byte-level remains the default + the fallback on tokenizer-load failure. Train-time warning fires when--tokenizeris set but--vocab-sizeis left at 256, and again if any encoded token ID lands outside[0, vocab-size). - ⬜ C5. Decode jitter under thermal load — ~1 day (needs sustained workload measurement)
- ✅ C6. ChatML template inline-system split —
splitChatmlSystemhelper + 6 unit tests. Shipped in49dead5. - ✅ C7. Save+reload XCTest for LoRA adapters — roundtrip + arch-mismatch coverage. Shipped in
49dead5. - ✅ C8. Install-path discipline —
~/.cache/tinygpt/for adapters + corpus discovery; off/tmp. Shipped in49dead5. - ✅ C9. Determinism harness (SHIPPED 2026-06-17) —
--seed Nnow seeds both MLXRandom ANDBatchRng(Splitmix64-backed host generator) inSources/TinyGPTModel/BatchRng.swift. All 7 corpus-samplerInt.random(in:)call sites swapped toBatchRng.randomInt(in:)acrossTrainer.swift,SFTCorpus.swift,PreferenceCorpus.swift. Two runs with the same seed now produce identical batch sequence (modulo prefetcher scheduling caveat — seedocs/determinism.md). 5 unit tests pin the contract: same-seed determinism, different-seed divergence, reset-then-reseed reproducibility, range bounds, Splitmix64 bit-pattern. - ✅ C10. Training-run dashboard (SHIPPED) —
--log-jsonl <path>intinygpt trainemits append-only JSONL viaSources/TinyGPT/TrainLog.swift; consumed bybrowser/src/pages/training-dashboard.astrofor live charts.
Tier 5 — RESEARCH FRONTIER (2026 stretch goals)
Pauses the “training at 2024 fundamentals” cadence; deliverable is a paper-shaped artifact + reproducible code + a scaling-curve point, NOT a polished UX feature.
- ⬜ 5.1 Reasoning training on a 22M model — 5-7 days; expected outcome is the negative result (CoT below emergence). Publishable.
- ⬜ 5.2 Test-time compute scaling — 3-5 days; quality-vs-FLOPs plot at 22M-scale matching Snell et al. methodology. Most cleanly publishable.
- ⬜ 5.3 Vision-language toy — ~2 weeks; ViT + projector + LLaVA-style. Smallest from-scratch VL model on consumer hardware.
- ⬜ 5.4 Diffusion LM micro-implementation — 1-2 weeks; new paradigm via masked denoising loss.
- ⬜ 5.5 Real sparse MoE kernels — 2-3 weeks; custom Metal kernel + measure FLOP reduction.
- ⬜ 5.6 TTS toy (text-to-speech via audio-token GPT) — ~2-4 weeks; integrate EnCodec, train an autoregressive decoder over discrete audio tokens (VALL-E / MusicGen shape). The transformer side already exists in TinyGPT; the new pieces are codec integration, text→audio conditioning, vocoder decode, and an audio data pipeline. Scoping note (2026-06-03): comes AFTER the Wave 3 specialist track (A1-B8) AND after 5.3 vision-language toy — both higher-priority research arcs ahead of it.
- ⬜ 5.7 Specialized explainer-video model — ~3-6 weeks for a Lamina-like toy: document/prompt → script → storyboard DSL → deterministic whiteboard/diagram render. This is NOT a Sora/Runway competitor; the first useful version is a specialized visual-planning model plus renderer. Scoping note (2026-06-04): comes after A1-B8 and after 5.3 VL, because it needs both specialist training discipline and the text↔visual bridge.
5.6 TTS toy — detailed scoping
What carries over from current TinyGPT:
| Piece | Reuse |
|---|---|
| Transformer decoder, KV cache, sampling, MTP heads (for K-codebook prediction) | direct |
Training loop (tinygpt train) + PEFT bundle for downstream fine-tunes | direct |
CrossAttention.swift (currently used for YOCO) | adapt to text-encoder K/V source for conditioning |
New code surface (~2 weeks of focused engineering + 3-7 days training):
| Piece | Effort |
|---|---|
| EnCodec encode/decode integration (Swift port of the HF EnCodec weights) | ~3-5 days |
| Text → conditioning surface (text encoder + cross-attention into decoder, OR text-as-prefix-tokens) | ~2-3 days |
| Audio data pipeline (LJSpeech / LibriTTS pre-tokenization to codec ids) | ~2-3 days |
| Eval (WER via Whisper transcription, MOS estimator) | ~2 days |
| First training run on LJSpeech single-speaker → intelligible speech | 2-4 days wall-clock |
Realistic outcome at this scale: smallest-published audio-token GPT (MusicGen-small) is ~300M; from-scratch on LJSpeech you get recognizable but not natural-sounding speech. The publishable artifact is the same shape as 5.3 — “smallest from-scratch ___ on consumer hardware.”
Why ordered after specialist + VL:
- Specialist track validates the north-star thesis (Wave 3 work the project is actually about). Until at least one specialist beats a baseline, modality experiments are noise on top of unproven foundations.
- 5.3 vision-language toy is ahead because (a) it’s the older Tier-5 item and (b) it stress-tests the same “external pretrained encoder + cross-attention into our decoder” pattern that TTS would reuse. Shipping VL first means TTS inherits a validated pattern instead of a speculative one.
5.7 Specialized explainer-video model — Lamina-like track
Reference product: Lamina Labs’ Simi positions itself as an AI explainer studio: prompt or document in, whiteboard-style educational video out for students, course creators, customer training, and teams. The public lesson is not “train a giant cinematic video model”; it is “make a narrow video system that explains accurately, quickly, and consistently.”
The TinyGPT version should start as a structured explainer compiler:
source document / prompt
-> lesson script
-> storyboard scenes
-> visual DSL (objects, labels, arrows, equations, timeline)
-> deterministic renderer (SVG/canvas/Remotion/Manim-style)
-> captions + voiceover + MP4
What we would need:
| Piece | Build | Why |
|---|---|---|
| Scene/storyboard schema | JSON DSL for concepts, equations, diagrams, timings, camera/stroke actions | Gives the model a constrained target instead of free-form pixels |
| Renderer | Start with SVG/canvas frames; later Remotion/Manim export | Deterministic, debuggable, cheap to render |
| Visual-planner specialist | SFT/LoRA model: prompt/doc → storyboard DSL | This is the first “specialized video model” worth training |
| Asset/diagram library | Shapes, arrows, axes, code blocks, graph layouts, simple physics/math primitives | Explainers need reusable semantic primitives more than photorealism |
| Data pipeline | Pair open lessons/transcripts/docs with generated or human-edited storyboards | The scarce asset is supervised storyboard data |
| Eval set | Held-out concepts with rubric: factual correctness, visual grounding, pacing, label consistency, equation validity | Prevents “pretty but wrong” videos |
| Editing loop | User can regenerate one scene, lock script, lock diagrams, export MP4 | Real workflows need partial repair, not one-shot magic |
Model ladder:
- No learned video model: use a strong text model or cloud model to produce the DSL; render deterministically. This validates product and schema fast.
- Tiny visual-planner specialist: fine-tune tinygpt/HF-loaded base on prompt/doc → storyboard DSL. This is the first trainable model.
- Visual critic/evaluator: model scores whether scene frames match the script and flags bad labels, missing objects, impossible diagrams.
- Optional diffusion/image/video model: only for decorative assets or scene backgrounds after the deterministic explainer path works.
Good first eval tasks:
| Eval | Metric |
|---|---|
| Concept-to-storyboard | JSON validity + human/LLM rubric on lesson coverage |
| Equation/diagram correctness | Symbol/label exactness, graph/axis consistency |
| Script-to-scene grounding | Every narrated claim maps to an on-screen object/action |
| Pacing | Scene duration fits narration without overcrowding |
| Editability | Regenerate one scene without changing locked scenes |
Why this is plausible for TinyGPT:
- The project already has specialist SFT/LoRA, structured output, constrained generation, eval harnesses, and renderer-friendly web/native surfaces.
- A storyboard DSL is text. TinyGPT can train on that before any pixel generation exists.
- Deterministic rendering avoids the hardest part of video generation: long-horizon visual consistency.
Why it stays behind the current specialist track:
- It is a new modality product, not a training-foundation prerequisite.
- Data is the bottleneck. We need hundreds to thousands of good storyboard pairs before model training is meaningful.
- The first marketable version is mostly pipeline + UX, not raw model research. Build it only after the first text/tool specialist proves the project can beat a baseline.
Unshipped techniques — after applying the value-add filter (and re-auditing)
Most items in the original roadmap-categories list either ship today (third-audit corrections below) or were dropped under the user’s “don’t list a technique unless it adds genuinely new capability” filter. What’s left:
Genuinely new value-adds, not yet built
After this session’s batch closes, the non-training surface IS exhausted — every capability item under your value-add filter has shipped. Only niche residue remains:
- ⬜ Sample packing (cross-source) — niche, doesn’t change capability at our scale
- ⬜ Vocab trimming — niche, only matters for embedded-deployment
After these: training-dependent (specialist Wave 3, Mini-Llama+ANE, Tier 5 modality arcs) or upstream-blocked (sparse MoE hard routing on scatter_add, real QLoRA on quantized-gradient flow).
Shipped this session (third → fourth audit pass corrections):
- ✅ Linear probes (
tinygpt linear-probe) - ✅ Deduplication (
tinygpt dedupe, line + doc modes) - ✅ ROME (
tinygpt rome, identity-Hessian first cut) - ✅ MEMIT (
tinygpt memit, single-layer least-squares, exact per-fact residual at scale=1) - ✅ Multi-layer MEMIT (
--layers SPEC, residual partitioned across N layers; 8-14% per-layer rel vs 41-72% single-layer) - ✅ MEMIT
--layer-weighting key-norm(data-driven proxy for Meng 2023’s causal-trace influence) - ✅ GGUF reader (
GGUFReader.swift+tinygpt gguf-inspect— F32/F16/Q4_0/Q8_0/Q4_K/Q5_K/Q6_K/Q8_K) - ✅ GGUF model loader validator (
tinygpt gguf-load— metadata parse, tensor-name mapping, shape validation against TinyGPT-HF op tree) - ✅ Best-of-N + Snell-style scaling curve (
tinygpt bon --scan) - ✅
bon --verifier corpus-ppl(corpus-anchored PPL as scoring signal — distinct from self-likelihood) - ✅ Sparse autoencoders (
tinygpt sae— Bricken et al. 2023; encoder + decoder + L1, .sae sidecar) - ✅ SAE feature explorer (
tinygpt sae-explore— load .sae, scan corpus, surface top-K activating windows per feature) - ✅ Activation patching CLI (
tinygpt patch— Mac CLI for zero + donor-swap; reuses shippedforwardWithPatch) - ✅ Causal trace CLI (
tinygpt causal-trace— Meng et al. 2022 per-layer fact localization) - ✅ MinHash near-duplicate dedup (
tinygpt dedupe --near-dup— catches paraphrased boilerplate that exact-SHA misses) - ✅ GGUF tokenizer + config extractor (
tinygpt gguf-extract— writes tokenizer.json + config.json + manifest, the missing piece between gguf-load and runnable model) - ✅ to-coreml conversion bridge (
tinygpt to-coreml— generates a tailored Python conversion script for the user’s coremltools install; now end-to-end runnable via safetensors hop) - ✅ Safetensors writer (
TinyGPTModel/SafetensorsWriter.swift— HF-compatible binary format; shared foundation) - ✅
tinygpt to-safetensors— converts.tinygpt→model.safetensorswith HF Llama tensor names (or--keep-namesfor native). Verified 196 tensors / 38.4 MB / valid HF format on the shakespeare gallery model. - ✅
tinygpt export-mlx— packages.tinygptstudents,.lora/.tglaadapters, and existing HF dirs as MLX-friendly safetensors directories with config/tokenizer sidecars plusmlx_load.py. This is the interop path for users who want to take TinyGPT fine-tune/distill artifacts into Python MLX or MLX-Swift. - ✅
gguf-extractmaterializes weights to safetensors — output directory is now a complete HuggingFace model bundle loadable viatransformers.AutoModelForCausalLM.from_pretrained(). Verified on a 21-tensor llama-shape GGUF: tokenizer.json + tokenizer_config.json + config.json + model.safetensors all populated. - ✅ to-coreml safetensors bridge — Python script no longer stubbed; loads weights via
safetensors.torch.load_file()with full HF Llama → TinyGPT name-map.py_compileclean.
Stale ⬜ markers caught + corrected this session — now ✅:
| Item | Where it ships |
|---|---|
| Embedding RMSNorm | --embedding-rmsnorm flag, RMSNorm module on token-embed |
| DeepNorm | --deep-norm flag, cfg.useDeepNorm/deepNormAlpha/deepNormBeta |
| Layer-wise LR decay | cfg.lrLayerDecay |
| Cosine warmup | --lr-schedule cosine --warmup 500 (the curated default) |
| BPE-dropout | BPEDropout.swift |
| Real CI | .github/workflows/ci.yml + deploy.yml |
| Persistent tokenized cache | TokenCache.swift wired into Train+Eval+Distill+Finetune |
| Linear probes | tinygpt linear-probe (this session, 6dbe15c) |
| YOCO cross-layer KV | --yoco flag, CrossAttention.swift, docs/yoco_results.md |
| GPTQ safetensors reader | GPTQReader.swift (72 tensors quantised in 31s) |
Dropped under value-add filter (duplicate / inferior / niche):
| Dropped | Why |
|---|---|
| ReLoRA | GaLore already gives “full fine-tune at LoRA memory cost” |
| Prefix tuning / soft prompts | LoRA covers the practical case |
| IPO | DPO with high β covers tiny-pair regularization |
| Token elimination | StreamingLLM + KIVI cover positional + per-entry-bits axes |
| Tree decoding | Speculative decode (vanilla + Medusa + EAGLE-2) covers the niche |
| Curriculum learning | Modest gains, scale-dependent; needs a difficulty metric we don’t have |
| Self-instruct / Evol-instruct | Magpie subsumes (uses model’s own distribution, no seed needed) |
| Hard example mining / Importance sampling | Marginal at our scale |
| Data quality filtering | PPL-filtering needs a ref model; basic dedup covers most of the value |
| BigBird / Longformer sparse attention | Only matters past ctx=8192 (we don’t train at that length) |
| Linear attention (Performer / Linformer / Reformer) | Quality usually worse than flash attention |
| Hybrid attention/SSM (Jamba, Samba) | Different family; side-project |
| Pre-norm vs post-norm toggle | Config knob, not a feature |
| Tiktoken adoption | swift-transformers handles BPE-family tokenizers already |
| Subword regularization | Marginal vs BPE-dropout |
| Train own BPE on corpus | Modest gain (~5% PPL); blocked on Rust-FFI for speed |
| TinyGPT-as-library API | User explicitly deferred until specialists beat a baseline |
Queued findings — ANE routing + Mac-vs-browser sampling
Triggered by the question “how do we get to 170× instead of 17×?” The 17.2× number is at Huge training — small bandwidth-bound model where kernel-launch overhead dominates. Several legitimate paths to a much larger ratio; each is queued with its honest cost.
1. Browser sampling tok/s harness — CHEAP, ~30 min
Closes a real missing measurement. We have Mac sampling tok/s
(293-696 by model size) but no analogous browser-side number. The
playground worker generates via GpuModel.generate already; we just
don’t time it.
What: in browser/src/worker.ts, log per-token wall-clock in the
generate loop, post a sampling_perf message, display tok/s next to
the playground output.
Expected ratio: Mac-vs-browser sampling probably 30-80× at Huge based on shape priors (Mac is much less kernel-launch-overhead sensitive during decode than during training). That alone changes the headline from “17× training” to “30-80× sampling, 17× training.”
Why queued: tiny work, just hasn’t been done. No blockers.
2. ANE-routed inference via Mini-Llama TinyGPT — MEDIUM, 1-2 weeks
Apple Neural Engine routes only when the graph hits its preferred shapes. The published numbers (ANEMLL, perf-quest memory) are 2-3× sampling over the same model on Apple GPU when ANE engages cleanly, not 100×+ end-to-end. The big win is the combined ratio: bigger ANE-friendly model × ANE-routing × already-unfit-for-browser size.
Why TinyGPT doesn’t route today
ANE prefers head_dim ∈ {64, 128}, tensor dims multiples of 64,
fp16, RoPE-style attention, bias-free linears, RMSNorm. Our Huge
default is the opposite of all of these:
| Dimension | TinyGPT Huge | Llama 3.1 8B | ANE impact |
|---|---|---|---|
head_dim | 32 | 128 | falls off ANE matrix engine |
d_model | 256 | 4,096 | tiny matmuls under-utilize ANE tiles |
vocab | 256 (byte) | 128,256 (BPE) | LM-head matmul too small to matter |
| Norm | LayerNorm | RMSNorm | RMSNorm has better ANE op coverage |
| Positional | learned absolute | RoPE | ANE’s fused-attention paths assume RoPE |
| MLP activation | GELU | SwiGLU | SwiGLU is the ANE-tuned default |
| Linear bias | yes | no | bias-free fuses cleaner into matmul-add |
What to build
A new ModelConfig preset — mini-llama — using only existing config
flags (every one of the above is already a knob):
ModelConfig(
vocabSize: 32768, // small BPE, multiple of 64
contextLength: 2048,
nLayers: 24,
nHeads: 16, // head_dim = 128
nKvHeads: 4, // GQA
dModel: 2048,
dMlp: 8192,
useRoPE: true,
useRMSNorm: true,
useSwiGLU: true,
tieEmbeddings: false,
)
// ~600M params; scale down to (1280, 16) for ~200M first cut
Plus tinygpt to-coreml exporter (~1-2 days): maps our transformer
ops to CoreML’s op set, produces a .mlpackage that Instruments can
profile to see whether ANE actually engages.
Realistic speedup expectations
| Path | Realistic tok/s |
|---|---|
| Current Huge on Mac GPU | 293-696 |
| Mini-Llama (~600M) on Mac GPU | ~150-400 |
| Mini-Llama on Mac ANE if it routes | ~400-1200 (~2-3× over its own GPU) |
| Mini-Llama in browser | ~5-20 (probably can’t load; 600M near browser ceiling) |
| Mac-ANE vs browser ratio | 30-200× depending on routing cleanliness |
Probability analysis
Test 1 (ANEMLL works on Llama 3.1 on your machine) → confirms the environment but NOT that our model routes. Independent reasons it could still fail:
ANEMLL on Llama works?
├─ No → done, environment broken
└─ Yes → environment confirmed
└─ Build tinygpt to-coreml exporter
└─ Convert + profile Mini-Llama
├─ All ops on ANE → 🎉 ~30-50% chance, you win
├─ Partial split → 🟡 ~40% chance, measure if net speedup
└─ Nothing on ANE → 😐 ~10-20% chance, GPU is the ceiling
Cost-benefit (honest)
| Item | Cost | Outcome regardless of ANE result |
|---|---|---|
| Train Mini-Llama (200-600M) | 3-7 days mostly-background | Real Llama-architecture gallery model. Useful independently. |
tinygpt to-coreml exporter | 1-2 days focused | Reusable for any future model. Useful independently. |
| Profile + iterate | 1-3 days unpredictable | Empirical learning either way. |
Total: 1-2 weeks calendar; dominated by training wall-clock.
Why queued
- Requires lifting the current “no training” goal constraint
- The trained Mini-Llama IS a valuable artifact independent of ANE, so the conditional EV is positive — but only if you’re willing to train.
- Doesn’t deliver 10× on Mac-alone (realistic 2-3× ANE-over-GPU); delivers 30-200× only via the Mac-ANE-vs-browser combined ratio.
- The cheaper browser-sampling-benchmark (item 1 above) is a prerequisite to even know the current sampling ratio — should do that first.
Apple’s actual ANE landscape (for posterity)
- CoreML (public) — convert to
.mlpackage, Apple’s runtime decides per-op CPU/GPU/ANE dispatch. Heuristics are opaque. No way to force ANE. - ANEMLL (community, github.com/Anemll/Anemll) — uses private CoreML internals to coerce more ops to ANE. Works on macOS Sequoia. Historically breaks on every macOS update. Hand-tuned for Llama-family.
- “Stateful Models API” (rumored late 2026) — would make ANE routing first-class. Not shipped.
There is no Apple-sanctioned “private beta” for ANE inference; that phrasing was loose. The real options are the three above.
4. Research absorbed — paper × verdict
External-paper catalogue (was docs/roadmap/recent_research.md, now
archived at docs/archive/recent_research.md). Each row: technique →
one-line source → verdict pointing at where it lives in this codebase,
or why it doesn’t.
4.1 Implemented (techniques we ship)
Alignment / preference
| Technique | Source | Where it lives |
|---|---|---|
| DPO | Rafailov et al., NeurIPS 2023 | tinygpt dpo |
| KTO | Ethayarajh et al., 2024 | tinygpt dpo --variant kto |
| ORPO | Hong et al., 2024 | tinygpt dpo --variant orpo |
| SimPO | Meng et al., 2024 | tinygpt dpo --variant simpo |
| NEFTune | Jain et al., NeurIPS 2023 | --neftune |
PEFT
All in native-mac/Sources/TinyGPTModel/PeftVariants.swift, surfaced via tinygpt sft.
| Technique | Source | Where it lives |
|---|---|---|
| DoRA | Liu et al., 2024 | default in sft |
| GaLore | Zhao et al., 2024 | Optimizers.swift |
| LoftQ | Li et al., ICLR 2024 | PeftVariants.swift |
| VeRA | Kopiczko et al., ICLR 2024 | PeftVariants.swift |
| PISSA | Meng et al., 2024 | PeftVariants.swift |
| LoRA+ | Hayou et al., ICML 2024 | PeftVariants.swift |
| rsLoRA | Kalajdzievski, 2023 | PeftVariants.swift |
Quantization
| Technique | Source | Where it lives |
|---|---|---|
| GPTQ | Frantar et al., ICLR 2023 | tinygpt gptq + GPTQReader.swift |
| AWQ | Lin et al., MLSys 2024 | AWQ safetensors reader |
| HQQ | Badri & Shaji, 2024 | tinygpt hqq |
| KIVI | Liu et al., 2024 | KV cache quantization path |
Inference / efficiency
| Technique | Source | Where it lives |
|---|---|---|
| Speculative decoding | Leviathan et al., ICML 2023 | tinygpt train-heads --type medusa|eagle + decode loop |
| Medusa | Cai et al., 2024 | same path, head type |
| EAGLE-2 | Li et al., 2024 | same path, head type |
| StreamingLLM | Xiao et al., ICLR 2024 | attention-sink path |
Architecture variants
| Technique | Source | Where it lives |
|---|---|---|
| MTP | Gloeckle et al., ICML 2024 | Train.swift, docs/mtp.md |
| Differential Transformer | Microsoft 2024 | DifferentialAttention.swift, --diff-attn |
| Mixture of Depths | Raposo et al., 2024 | soft sigmoid gate (hard top-K upstream-blocked) |
| LASER | Sharma et al., ICLR 2024 | tinygpt laser |
Optimizers
| Technique | Source | Where it lives |
|---|---|---|
| Sophia | Liu et al., 2023 | Optimizers.swift |
| Lion | Chen et al., NeurIPS 2023 | Optimizers.swift |
| Muon | Jordan, 2024 | Optimizers.swift |
| GaLore | (see PEFT) | Optimizers.swift |
Distillation
| Technique | Source | Where it lives |
|---|---|---|
| Soft-targets distillation | Hinton et al., 2015 | tinygpt distill |
Synthetic data
| Technique | Source | Where it lives |
|---|---|---|
| Magpie | Xu et al., ICLR 2025 | tinygpt magpie |
| TinyStories | Eldan & Li, 2023 | dataset source |
Test-time compute
| Technique | Source | Where it lives |
|---|---|---|
| Best-of-N | Snell et al., 2024 | tinygpt bon --scan |
Evolution Strategies
| Technique | Source | Where it lives |
|---|---|---|
| ES at scale | Qiu et al., Sept 2025 | tinygpt es, docs/evolution_strategies.md |
4.2 Cannot — blocked, parked, or skipped
🚧 Blocked by hardware
| Technique | Source | Why parked |
|---|---|---|
| BitNet b1.58 | Ma et al., 2024 | Ternary from-scratch needs 100B+ tokens to validate; not differentiating at <1B params on our hardware. Park; revisit if a clear gallery-model use case appears. |
| FP4 training (NVFP4 / Quartet) | Wang Jan 2025 · Quartet II Jan 2026 | Apple M-series has no native FP4 ops |
| FP8 training | — | Needs H100 / Blackwell |
🚧 Blocked upstream
| Technique | Source | Why parked |
|---|---|---|
| Hard sparse MoE routing | DeepSeek-V3 family | MLX-Swift no scatter_add; soft (dense) routing ships |
| Real QLoRA | Dettmers et al., 2023 | MLX-Swift quantized arrays don’t autograd through; manual fake-quant shipped (pedagogical, no memory win) |
❌ Skipped — different family / not worth the seat
| Technique | Source | Why skipped |
|---|---|---|
| Mamba / Mamba-2 | Gu & Dao, 2023/2024 | Linear-time SSM, different family; better as side-project |
❌ Dropped — value-add filter (subsumed by what ships)
| Technique | Source | Subsumed by |
|---|---|---|
| IPO | Azar et al., 2023 | DPO with high β regularizes equivalently |
| CPO | Xu et al., 2024 | DPO + BC term marginal over SimPO at our scale |
| Self-Instruct | Wang et al., 2023 | Magpie (model’s own distribution; no seed needed) |
| Evol-Instruct | Xu et al., 2024 (WizardLM) | Magpie subsumes |
| MiniPLM | Gu et al., NeurIPS 2024 | Distill-for-pretraining — needs a teacher-student pair we don’t have |
| Distillation with Training Wheels | Feb 2025 | cloud-escalate already provides the analogous “student asks teacher” deployment shape |
| DEITA | Liu et al., 2024 | Instruction-data quality framework — only matters once SFT corpus > 1M samples |
4.3 Planned — queued for a future training run
| Item | Source | Where in §3 |
|---|---|---|
| GRPO / DAPO (RLVR pipeline) | DeepSeek-R1, Jan 2025 · DAPO, March 2025 | Tier 5 §5.1 — Reasoning training on a 22M model. GRPO = mental model; DAPO = implementation. |
| Reasoning-trace distillation | DeepSeek-R1-Distill series, OpenThoughts | Tier 5 §5.1 — SFT-on-traces is the first half of §5.1 before RLVR |
| Snell test-time-compute scaling experiment | Snell et al., 2024 | Tier 5 §5.2 — bon shipped; the scaling-curve experiment at 22M matches Snell methodology |
| Vision-language toy | LLaVA family | Tier 5 §5.3 |
| Diffusion LM micro | (multiple) | Tier 5 §5.4 |
| Real sparse MoE kernels | DeepSeek-V3 style | Tier 5 §5.5 (also upstream-blocked on scatter_add) |
| TTS toy | VALL-E / MusicGen family | Tier 5 §5.6 |
Small additions, no current owner — append when a slot opens:
| Item | Source | Effort |
|---|---|---|
| LISA optimizer | Pan et al., 2024 | ~1 day; layerwise importance sampling, drop-in alongside Sophia/Muon |
| MiniLLM KL variants | Gu et al., ICLR 2024 | ~1-2 days; reverse-KL / skew-KL switches on top of existing tinygpt distill |
| Distilling Step-by-Step | Hsieh et al., ACL 2023 | ~1-2 days; rationale-distillation recipe on top of tinygpt distill |
| DoReMi data-mixture optimization | Xie et al., NeurIPS 2023 | Park until ≥3 distinct domains are mixed at non-trivial scale |
| Quality classifier (FineWeb-Edu-style) | Penedo et al., 2024 — FineWeb / FineWeb-Edu | §3 B10 — ~2 days; tiny fastText scorer + top-X% filter |
| WSD schedule (warmup-stable-decay) | MiniCPM, Hu et al., 2024 · SmolLM blog | §3 B11 — ~half-day; decay phase doubles as annealing |
| Interp-on-checkpoints methodology | Pythia, Biderman et al., 2023 · OLMo, Groeneveld et al., 2024 | §3 B13 — 1-2 days infra + ongoing analysis; replay SAE / MEMIT across the checkpoint timeline |
| Speculative decoding | Leviathan et al., ICML 2023 · Chen et al., 2023 | §3 B14 — 2-3 days; Mini-Llama draft for Mega; numerics gate required |
| Layer-wise LR decay (SFT) | ULMFiT, Howard & Ruder, 2018 | §3 B15 — ~half-day flag add on existing optimizer |
| M5 GPU Neural Accelerator prefill benchmark | Apple ML Research, 2026 | §3 B16 — ~half-day; verify the claimed 3.5× M5-vs-M4 prefill speedup is materializing on our path |
| SAE Lens interop / Neuronpedia format export | decoderesearch/SAELens | §3 B17 — ~2 days for format-export option; compare-and-decide before building |
nanochat-style --depth single-knob HP derivation | karpathy/nanochat | §3 B18 — ~1 day; one knob auto-derives width / heads / LR / batch / steps; UX win |
| Group-SAE (layer-group SAE training) | Wang et al., 2024 | §3 B19 — 2-3 days; trains SAEs once per layer-group instead of per-layer; cuts SAE training cost |
| Learnable cross-stream attention (modded-nanogpt speedrun trick) | KellerJordan/modded-nanogpt | §3 B20 — read-and-evaluate; speedrun-specific, not yet a paper |
| ScaleDown extractive context compression SLM | ScaleDown blog · Challenge leaderboard · scaledown.ai | §3 B25 — 3-5 days; token-level relevance head + sentence aggregation; submit to public leaderboard as a “specialist trained on a Mac” proof-point |
| Micro-AutoMixer for specialist data mixes | Poolside Laguna deep dive · RegMix/DoReMi-style mixture search | §3 B21 — small proxy-run version of Poolside’s automixing; optimize specialist ratios before full training |
| Token-preserving agent trajectory recorder | Poolside Laguna deep dive | §3 B22 — preserve token IDs through rollout → training so agent traces cannot drift through retokenization |
| Agent eval protocol hardening | Poolside Laguna deep dive | §3 B23 — repeated pass@1, fixed step/resource/sampling budgets, and explicit infra-patch notes |
| Muon large-scale re-benchmark | Poolside Laguna deep dive · Jordan, 2024 | §3 B24 — only revisit if large/proxy matmul-dominated runs amortize Newton-Schulz overhead |
4.4 Reference reads (no verdict — context only)
For mental-model framing, not techniques to implement:
- State of GPT (Karpathy, 2023) — pretrain → SFT → RM → PPO; we skip RM/PPO for DPO
- Tulu 3 (Lambert et al., 2024) — open RLVR recipe; informs §5.1
- SmolLM blog (HF, 2024) — 135M/360M/1.7B small-model recipe
- HuggingFace Alignment Handbook (repo) — reference SFT/DPO recipes at 7B
- Survey of LLMs (Zhao et al., arXiv 2303.18223) — broad survey, continuously updated
- On-Policy Distillation Survey (April 2026) — confirms distillation dominates for shipping small models
2026 small-model peers (for positioning, not adoption): SmolLM3-3B · Qwen3.5-0.8B · Phi-4-mini-instruct · Gemma-3n-E2B-IT · Gemma-4-12B Unified (encoder-free multimodal, 256K ctx, MLX variants exist). Implication: the niche is “browser-trainable + every byte of training code is here,” not “perf-competitive with Phi-4.”
Direct from-scratch peers (full pipeline, not just pretrain):
- karpathy/nanochat — tokenizer → pretrain → SFT → RL → CLI/web chat in one repo. $48/2h on 8×H100. Apple Silicon mode exists via
runs/runcpu.sh(degraded scale). No interpretability story. Single--depthknob auto-derives all HPs. Closest head-on competitor; differentiation = Mac-first + interp lab. - KellerJordan/modded-nanogpt — speedrun fork; April 2026 record 1.35 min to GPT-2 quality on 8×H100. Playbook: Muon (we have) · FA3 · FP8 head (HW-blocked) · learnable cross-stream attention · MTP (queued).
- Poolside Laguna XS.2 / M.1 deep dive — agentic coding models with open XS.2 weights, strong SWE/Terminal benchmark protocol, quality+diversity data curation, synthetic data throughout pretraining, automixed data ratios, Muon at scale, and async agent RL. Steal the workflow discipline, not the scale: data-mix proxy sweeps, token-preserved agent traces, repeated eval protocol, and Muon only after large-scale re-benchmark.
Tools worth knowing:
- Unsloth — Triton-kernel fine-tune framework; not Mac/MLX but study for technique transfer. Feb 2026: 12× faster MoE training + embedding model support + ultra-long-context RL.
- Axolotl — config-driven multi-GPU production fine-tuner; multimodal support landed 2026
- LLaMA-Factory — web-UI fine-tuner (LlamaBoard); zero-config entry point
- TorchTune — Meta’s PyTorch-native fine-tuner; ~20-24% speedup via PyTorch 2.5 compile
- Argilla Distilabel — Python pipeline for synthetic SFT/DPO (wraps Magpie/DEITA)
Apple Silicon ecosystem (direct peers on our platform):
- mlx-lm — Apple’s official MLX inference + LoRA / DoRA / QLoRA / full fine-tune + OpenAI-compatible server. Direct overlap with our SFT/DPO LoRA path; differentiation = pretrain + interp + GGUF/CoreML export.
- Ollama + MLX backend (v0.19, March 2026) — prefill 1154→1810 tok/s, decode 58→112 tok/s on Apple Silicon. Direct competition for our GGUF runner.
- exo-explore/exo — multi-Mac P2P distributed inference. JACCL collectives over RDMA-on-Thunderbolt-5 on macOS 26.2 → 1.8×/3.2× speedup on 2/4 devices. Out of single-machine scope, but the infra is new.
Interpretability ecosystem (overlap with our interp lab):
- SAELens — established SAE training/analysis library; integrates with TransformerLens + HF + nnsight + Neuronpedia. Our SAE may be reinventing; B18 task = compare + decide on interop format.
- TransformerLens · nnsight (NDIF) — PyTorch interp infra; complementary to SAELens. We have native Swift/MLX equivalents.
Proprietary / out of scope: OpenAI o1 / o3 (closed-weights; reframed the field around test-time compute, no adoptable artifact). DeepSeek-V3 (671B-MoE, scale-blocked; informs MTP + MoE design). Qwen3 (model family, not a technique).
4.5 Coverage cutoff
The catalogue was hand-curated up to assistant knowledge cutoff January 2026 plus best-effort web-search additions for Feb-May 2026 (coverage spottier there). Today is 2026-06-04.
2026-06-04 web sweep folded in — five surfaces were checked (Apple Silicon training, nanoGPT successors, Mac inference runtimes, interpretability libraries, fine-tune frameworks). Results: nanochat
- modded-nanogpt added as direct from-scratch peers; mlx-lm + Ollama-MLX + EXO added as Apple Silicon ecosystem peers; SAELens added as interp peer; B16-B20 queued in §3 from surfaced gaps; Unsloth Feb-2026 release notes folded into tools row. Coverage of Feb-Jun 2026 papers is now meaningfully better but still not exhaustive.
Future papers append row-by-row into §4.1 / §4.2 / §4.3.
Appendix — index of source docs absorbed by this file
This doc replaces the multi-file roadmap split. The source docs are kept for context but should be treated as historical; edit this file, not them.
| Old doc | What it covered | Status |
|---|---|---|
docs/roadmap/index.md | TOC for the multi-file split | Superseded — point at this file |
docs/roadmap/tier1.md / tier2.md / tier3.md | ROI-tiered technique inventory | Absorbed; markers refreshed |
docs/roadmap/tier4_skip.md | Intentionally-not-built items | Absorbed into §2 |
docs/roadmap/tier5_frontier_2026.md | 2026 research frontier | Absorbed into §3 Tier 5 |
docs/roadmap/categories.md | Orthogonal technique taxonomy (had stale markers) | Absorbed; refreshed against code |
docs/roadmap/blockers.md | What we can’t build + Phase 9/10 status appendix | Absorbed into §2 + §1 |
docs/roadmap/phased_plan.md | 7-week sequential plan | Mostly shipped; remainder in §3 |
docs/roadmap/recommended_order.md | Top-10 next | Superseded by Tier A/B ordering in §3 |
docs/roadmap/honest_summary.md | ”CAN / CAN’T / SHOULDN’T” framing | Absorbed |
docs/progress.md | Mac+Web shipped dashboard | Absorbed into §1 |
docs/backlog.md | ROI-ordered “what’s left” (Tier A/B/C/D) | Absorbed into §3 |
docs/feature_audit_2026_05_31.md | CLI smoke audit | Cross-referenced; was the verification baseline |
docs/roadmap/recent_research.md | Paper catalogue (2024-2026) | Absorbed into §4; archived at docs/archive/recent_research.md |
Still canonical (deep dives, not absorbed): docs/roadmap/datasets.md,
docs/roadmap/north_star_refined.md, and the per-technique docs
(distillation.md, interpretability.md,
moe.md, mtp.md, lora_guide.md, precision.md, memory_tradeoffs.md,
perf_quest.md, decision_log.md). Those don’t duplicate planning — they
explain how shipped pieces work.