An open-source LLM factory for one Mac.

Build routed specialists that earn their keep — on a laptop you already own.

TinyGPT is a Swift codebase + 30+ CLI subcommands for the full post‑training stack on Apple Silicon: train, distill from a local teacher, fine‑tune with LoRA / DoRA / QLoRA, gate with real evals (BFCL, τ‑bench, lm‑eval), serve OpenAI‑compatible, and open the model up with sparse autoencoders, MEMIT, activation patching. Same model also trains and runs in a browser tab via hand‑written WebGPU kernels. The first packaged specialist is honest by design: frontier‑level on file‑ops, routed only because it regresses outside that domain.

First package: qwen3‑4B file‑ops specialist · 7.5 GB local HF/MLX artifact · routed specialist, not a general planner.

The strongest measured claim

A 4B routed file‑ops specialist, locally distilled, matches the frontier on its multi‑turn hard gate.

Qwen3‑4B file‑ops specialist
  • DeepSeek‑V4‑pro frontier 100%
  • Qwen3‑4B file‑ops specialist tinygpt 100%
  • Gemma‑4‑12B‑qat 83%
  • Qwen3‑4B (stock + plan prompt) 75%
  • Gemma‑3‑12B 33%

BFCL multi‑turn hard gate, n=12 GorillaFileSystem agentic tasks, task‑completion rate. Distillation: ~99 frontier trajectories × LoRA SFT, single Mac, no cloud train. Cheap routed specialist > bigger general on one domain is the project thesis — not "4B > 12B in general." The same file‑ops specialist regresses on out‑of‑domain breadth (60 → 42%), so TinyGPT packages it with the caveat instead of pretending it is a general planner. Full writeup, including the honest tradeoffs →

Measured, on one Mac.

Numbers below are from a stock M5 Pro / 48 GB. Reproducible via tinygpt bench and the linked artefacts. Where a number applies only to a specific preset or build, the preset is named.

Decode throughput
696tok/s

Huge preset, 221M params. 293 tok/s on the 960M Mega pilot.

TTFT warm
5.8msp99

Cold start: 24 ms on a 1B model. ITL p99: 4.9 ms.

Training step
42ms

Huge preset, 17.2× the same model in‑browser via WebGPU.

WebGPU vs WASM SIMD
12.1×

At d_model=256. Scales 2.6×→12.1× as d_model grows 96→256.

ANE M8 decode
17tok/s

28‑block Qwen3 chain on the Apple Neural Engine, layer‑chunked Core ML.

Largest fit
960M

Params trainable end‑to‑end on unified memory; 473M in a browser tab via Memory64.

The loop the runtime closes.

Each surface emits the input the next one needs. The agent runtime records token‑preserving trajectories; those trajectories become SFT data; that data trains the next specialist; the eval gate decides whether it ships.

  1. Data tinygpt download‑dataset 22 curated sets. HF streaming, Magpie synth, GitHub issue→PR, MS‑MARCO, the‑stack‑smol.
  2. Distill tinygpt distill KL + NLL teacher→student. Local teacher (Qwen3, Gemma, DeepSeek via API or Codex CLI). Rejection‑sample on a checker.
  3. Train / SFT / DPO tinygpt sft · dpo · es Full PEFT bundle: LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA. DPO/SimPO/KTO/ORPO in one trainer.
  4. Eval gate tinygpt eval‑gate BFCL, τ‑bench, lm‑eval (MLX adapter), HumanEval+sandbox, judge shim. Exit‑non‑zero on regression.
  5. Serve tinygpt serve OpenAI + Ollama surfaces on the same socket. Continue.dev / any OpenAI client plugs in unchanged.
  6. Agent tinygpt agent Multi‑turn, tool dispatch, persistent KV, FSM‑constrained JSON, optional ‑‑cloud‑escalate.
  7. Traces → data tinygpt traces‑to‑data Every rollout is a token‑preserving .atraj. Filter, dedupe (MinHash), emit ChatML SFT JSONL. Then loop to step 02.

What ships, audited against the code.

PLAN.md is the canonical shipped/skipped/TODO ledger. Each pillar below is one subset; clicking through gets you to specific files and the papers each technique cites.

Train

Every modern pre/post‑train, one trainer.

  • Pretrain — byte‑level or BPE; WSD schedule, gradient checkpointing, spike recovery
  • SFT — response‑only masking, ChatML / Alpaca / Llama / plain
  • Preference — DPO, SimPO, KTO, ORPO (one trainer, flags)
  • PEFT — LoRA, LoRA+, DoRA (TGLA v2 on disk), VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA, LoRA‑FA
  • Optimisers — AdamW, Lion, Sophia, Muon, Adafactor, GaLore
  • Architecture — RoPE+GQA, sliding window, ALiBi, MoE, MoD, differential attn, YOCO, MTP
Distill

Frontier teacher → small student.

  • KL + NLL mix loss with temperature and α (Hinton 2015)
  • Rejection‑sampled trajectories — keep only what a checker passes
  • Gold‑clone fallback — on verifiable tasks the gold IS the trajectory; teacher‑free reproduces the same model
  • Trace replay — render in the student's own chat template via render_sft_from_traj.py
  • Headline result — 4B at frontier‑parity, documented honestly
Eval

Where the moat is.

  • Shared schema — every eval emits the same JSONL row (E0)
  • tinygpt eval‑bfcl — 10 BFCL categories, OpenAI‑compat shim into tinygpt serve
  • tinygpt eval‑tau‑bench — retail + airline, configurable user simulator
  • tinygpt run‑lm‑eval — MLX adapter for lm‑evaluation‑harness, two modes
  • tinygpt eval‑humaneval — Rust + sandbox‑exec code execution
  • tinygpt eval‑gate — CI‑grade regression gate, exits non‑zero
Serve · Agent

OpenAI‑compatible. Locally.

  • tinygpt serve — OpenAI and Ollama on the same socket; Continue.dev provider compat
  • tinygpt agent — multi‑turn loop, tool dispatch, persistent KV, FSM JSON
  • ‑‑cloud‑escalate — defer to Anthropic / OpenAI only when the local model wants to
  • Speculative decoding — vanilla, Medusa, EAGLE‑2 (trainable heads)
  • StreamingLLM sink, KV‑cache quant (KIVI), prefix caching
  • Token‑preserving traces.atraj rollouts feed next round of SFT
Interp

Open the model up.

  • Logit lens · tuned lens (trainable per‑layer probes)
  • Activation patching — zero + donor‑swap, on (layer, position)
  • Per‑layer ablation · attention heatmap
  • Linear probes — trainable per‑layer classifiers, .lp sidecar
  • ROME · MEMIT — rank‑1 and rank‑K fact editing, multi‑layer with key‑norm weighting
  • Sparse autoencoderstinygpt sae, group‑SAE, SAELens / Neuronpedia export
In a browser tab

The same model, no install.

Hand‑written WGSL kernels train a GPT‑2 in your tab. Blocked 4×4 matmul (5.18× kernel speedup at 2048³), FA2 forward + backward in WGSL, Memory64 lifts the 4 GB tab heap so a 473M‑param model allocates cleanly. End‑to‑end parity vs. the WASM reference: ≤ 2.5% loss drift.

Open the playground → See the speedup curve →

Run it.

Two paths. The Mac path is the primary one; the browser path is for when you want everything in‑tab, no toolchain, no install.

On a Mac

macOS 14+ · Xcode · Apple Silicon · MLX‑Swift

# build the CLI
cd native‑mac
export DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer
xcodebuild -scheme tinygpt -destination 'platform=macOS,arch=arm64' \
  -derivedDataPath .xcode-build build

# one‑command quickstart: data → specialist → chat
tinygpt quickstart --data my.jsonl

# or distill from a local teacher
tinygpt distill --teacher qwen3-4b --student huge \
  --data ./traces --out ./student.tinygpt

# then serve it OpenAI‑compatible
tinygpt serve --model ./student.tinygpt --port 8080

Full build instructions →

In a browser tab

WebGPU · Chrome 113+ / Safari 18+ · no install

The playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired in — attention heatmap, logit lens, ablation, patching — under one "Inspect & evaluate" panel.

  • Parity‑tested against the WASM reference to ≤ 2.5% loss drift
  • OPFS checkpoint persistence — a run survives a tab refresh
  • Capability detection picks the right preset for your machine

Open the playground →

Honest scope.

What this is

  • A single‑developer project, shipping in public, MIT.
  • Mac‑first — M‑series, unified memory, MLX‑Swift.
  • A factory for specialists, not a general assistant.
  • An OpenAI‑compatible runtime any client already speaks.
  • A research substrate for interp, eval, and distillation.

What this isn’t (yet)

  • A general‑skill specialist. The frontier‑parity 4B dropped on out‑of‑domain BFCL (60 → 42%) — real catastrophic forgetting. Mixed‑backend distillation is the documented fix.
  • Multi‑GPU or distributed training. One device, single Mac.
  • A cloud product. Nothing leaves the laptop unless ‑‑cloud‑escalate is set.
  • An enterprise platform. No SSO, no tenancy, no SLA.
  • A finished story. The roadmap shows what’s shipped and what’s next.

Read the journey.

Decision logs and per‑technique explainers; each ties back to the paper it cites and the file where it lives.