An open-source LLM factory for one Mac.
Build routed specialists that earn their keep —
on a laptop you already own.
TinyGPT is a Swift codebase + 30+ CLI subcommands for the full
post‑training stack on Apple Silicon: train, distill from a local
teacher, fine‑tune with LoRA / DoRA / QLoRA, gate with real evals
(BFCL, τ‑bench, lm‑eval), serve OpenAI‑compatible,
and open the model up with sparse autoencoders, MEMIT, activation
patching. Same model also trains and runs in a browser tab via
hand‑written WebGPU kernels. The first packaged specialist is
honest by design: frontier‑level on file‑ops, routed only because
it regresses outside that domain.
frontier gate 100%
file ops route armed
oos guard caveated
qwen3-4b-file-ops-distilled HF artifact -> MLX adapter -> OpenAI-compatible socket
7.5 GB 4B params BFCL gated MIT code
distill eval serve route-only
First package: qwen3‑4B file‑ops specialist · 7.5 GB local HF/MLX artifact ·
routed specialist, not a general planner.
The strongest measured claim
A 4B routed file‑ops specialist, locally distilled, matches the
frontier on its multi‑turn hard gate.
BFCL multi‑turn hard gate, n=12 GorillaFileSystem agentic tasks,
task‑completion rate. Distillation: ~99 frontier trajectories ×
LoRA SFT, single Mac, no cloud train. Cheap routed specialist > bigger general
on one domain is the project thesis — not "4B > 12B in general."
The same file‑ops specialist regresses on out‑of‑domain breadth
(60 → 42%), so TinyGPT packages it with the caveat instead of
pretending it is a general planner.
Full writeup, including the honest tradeoffs →
Measured, on one Mac.
Numbers below are from a stock M5 Pro / 48 GB. Reproducible via
tinygpt bench and the linked artefacts. Where a number
applies only to a specific preset or build, the preset is named.
- Decode throughput
- 696tok/s
Huge preset, 221M params. 293 tok/s on the 960M Mega pilot.
- TTFT warm
- 5.8msp99
Cold start: 24 ms on a 1B model. ITL p99: 4.9 ms.
- Training step
- 42ms
Huge preset, 17.2× the same model in‑browser via WebGPU.
- WebGPU vs WASM SIMD
- 12.1×
At d_model=256. Scales 2.6×→12.1× as d_model grows 96→256.
- ANE M8 decode
- 17tok/s
28‑block Qwen3 chain on the Apple Neural Engine, layer‑chunked Core ML.
- Largest fit
- 960M
Params trainable end‑to‑end on unified memory; 473M in a browser tab via Memory64.
The loop the runtime closes.
Each surface emits the input the next one needs. The agent runtime
records token‑preserving trajectories; those trajectories become
SFT data; that data trains the next specialist; the eval gate decides
whether it ships.
- 01 Data tinygpt download‑dataset 22 curated sets. HF streaming, Magpie synth, GitHub issue→PR, MS‑MARCO, the‑stack‑smol.
- 02 Distill tinygpt distill KL + NLL teacher→student. Local teacher (Qwen3, Gemma, DeepSeek via API or Codex CLI). Rejection‑sample on a checker.
- 03 Train / SFT / DPO tinygpt sft · dpo · es Full PEFT bundle: LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA. DPO/SimPO/KTO/ORPO in one trainer.
- 04 Eval gate tinygpt eval‑gate BFCL, τ‑bench, lm‑eval (MLX adapter), HumanEval+sandbox, judge shim. Exit‑non‑zero on regression.
- 05 Serve tinygpt serve OpenAI + Ollama surfaces on the same socket. Continue.dev / any OpenAI client plugs in unchanged.
- 06 Agent tinygpt agent Multi‑turn, tool dispatch, persistent KV, FSM‑constrained JSON, optional
‑‑cloud‑escalate. - 07 Traces → data tinygpt traces‑to‑data Every rollout is a token‑preserving
.atraj. Filter, dedupe (MinHash), emit ChatML SFT JSONL. Then loop to step 02.
What ships, audited against the code.
PLAN.md is the canonical shipped/skipped/TODO ledger. Each pillar
below is one subset; clicking through gets you to specific files
and the papers each technique cites.
Train Every modern pre/post‑train, one trainer.
- Pretrain — byte‑level or BPE; WSD schedule, gradient checkpointing, spike recovery
- SFT — response‑only masking, ChatML / Alpaca / Llama / plain
- Preference — DPO, SimPO, KTO, ORPO (one trainer, flags)
- PEFT — LoRA, LoRA+, DoRA (TGLA v2 on disk), VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA, LoRA‑FA
- Optimisers — AdamW, Lion, Sophia, Muon, Adafactor, GaLore
- Architecture — RoPE+GQA, sliding window, ALiBi, MoE, MoD, differential attn, YOCO, MTP
Distill Frontier teacher → small student.
- KL + NLL mix loss with temperature and α (Hinton 2015)
- Rejection‑sampled trajectories — keep only what a checker passes
- Gold‑clone fallback — on verifiable tasks the gold IS the trajectory; teacher‑free reproduces the same model
- Trace replay — render in the student's own chat template via
render_sft_from_traj.py - Headline result — 4B at frontier‑parity, documented honestly
- Shared schema — every eval emits the same JSONL row (E0)
- tinygpt eval‑bfcl — 10 BFCL categories, OpenAI‑compat shim into
tinygpt serve - tinygpt eval‑tau‑bench — retail + airline, configurable user simulator
- tinygpt run‑lm‑eval — MLX adapter for
lm‑evaluation‑harness, two modes - tinygpt eval‑humaneval — Rust +
sandbox‑exec code execution - tinygpt eval‑gate — CI‑grade regression gate, exits non‑zero
Serve · Agent OpenAI‑compatible. Locally.
- tinygpt serve — OpenAI and Ollama on the same socket; Continue.dev provider compat
- tinygpt agent — multi‑turn loop, tool dispatch, persistent KV, FSM JSON
- ‑‑cloud‑escalate — defer to Anthropic / OpenAI only when the local model wants to
- Speculative decoding — vanilla, Medusa, EAGLE‑2 (trainable heads)
- StreamingLLM sink, KV‑cache quant (KIVI), prefix caching
- Token‑preserving traces —
.atraj rollouts feed next round of SFT
Interp Open the model up.
- Logit lens · tuned lens (trainable per‑layer probes)
- Activation patching — zero + donor‑swap, on (layer, position)
- Per‑layer ablation · attention heatmap
- Linear probes — trainable per‑layer classifiers,
.lp sidecar - ROME · MEMIT — rank‑1 and rank‑K fact editing, multi‑layer with key‑norm weighting
- Sparse autoencoders —
tinygpt sae, group‑SAE, SAELens / Neuronpedia export
In a browser tab The same model, no install.
Hand‑written WGSL kernels train a GPT‑2 in your tab.
Blocked 4×4 matmul (5.18× kernel speedup at 2048³), FA2
forward + backward in WGSL, Memory64 lifts the 4 GB tab
heap so a 473M‑param model allocates cleanly.
End‑to‑end parity vs. the WASM reference: ≤ 2.5%
loss drift.
Open the playground → · See the speedup curve →
Run it.
Two paths. The Mac path is the primary one; the browser path is for
when you want everything in‑tab, no toolchain, no install.
# build the CLI
cd native‑mac
export DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer
xcodebuild -scheme tinygpt -destination 'platform=macOS,arch=arm64' \
-derivedDataPath .xcode-build build
# one‑command quickstart: data → specialist → chat
tinygpt quickstart --data my.jsonl
# or distill from a local teacher
tinygpt distill --teacher qwen3-4b --student huge \
--data ./traces --out ./student.tinygpt
# then serve it OpenAI‑compatible
tinygpt serve --model ./student.tinygpt --port 8080
Full build instructions →
The playground builds a transformer in WebAssembly + WebGPU,
trains it in a Web Worker so the UI never freezes, and lets you
watch the loss curve live as the model picks up structure.
Every interpretability surface above is wired in —
attention heatmap, logit lens, ablation, patching — under
one "Inspect & evaluate" panel.
- Parity‑tested against the WASM reference to ≤ 2.5% loss drift
- OPFS checkpoint persistence — a run survives a tab refresh
- Capability detection picks the right preset for your machine
Open the playground →
Honest scope.
What this is
- A single‑developer project, shipping in public, MIT.
- Mac‑first — M‑series, unified memory, MLX‑Swift.
- A factory for specialists, not a general assistant.
- An OpenAI‑compatible runtime any client already speaks.
- A research substrate for interp, eval, and distillation.
What this isn’t (yet)
-
A general‑skill specialist. The frontier‑parity 4B
dropped on out‑of‑domain BFCL (60 → 42%) —
real catastrophic forgetting. Mixed‑backend distillation
is the documented fix.
- Multi‑GPU or distributed training. One device, single Mac.
- A cloud product. Nothing leaves the laptop unless
‑‑cloud‑escalate is set. - An enterprise platform. No SSO, no tenancy, no SLA.
- A finished story. The roadmap shows what’s shipped and what’s next.
Read the journey.
Decision logs and per‑technique explainers; each ties back to
the paper it cites and the file where it lives.
Tool‑calling: frontier‑parity at 4B The headline result, the distillation recipe, and the honest tradeoffs PLAN.md — shipped · skipped · TODO The canonical ledger, audited against the code Knowledge distillation KL + NLL mix loss, temperature, α, rejection sampling Three phases of training Pretrain → SFT → DPO with paste‑ready commands Interpretability tools Logit lens, tuned lens, ablation, patching, SAE, ROME, MEMIT Agent runtime Multi‑turn loop, tool dispatch, traces, cloud escalation Eval leaderboard viewer Drag‑drop a JSONL; compare by step / model / task Lessons from the build The bugs and surprises that were worth more than the kernels