An open-source LLM factory for one Mac.

Build routed specialists that earn their keep — on a laptop you already own.

TinyGPT is a Swift codebase + 30+ CLI subcommands for the full post‑training stack on Apple Silicon: train, distill from a local teacher, fine‑tune with LoRA / DoRA / QLoRA, gate with real evals (BFCL, τ‑bench, lm‑eval), serve OpenAI‑compatible, and open the model up with sparse autoencoders, MEMIT, activation patching. Same model also trains and runs in a browser tab via hand‑written WebGPU kernels. The first packaged specialist is honest by design: frontier‑level on file‑ops, routed only because it regresses outside that domain.

First package: qwen3‑4B file‑ops specialist · 7.5 GB local HF/MLX artifact · routed specialist, not a general planner.

The strongest measured claim

A 4B routed file‑ops specialist, locally distilled, matches the frontier on its multi‑turn hard gate.

Qwen3‑4B file‑ops specialist
  • DeepSeek‑V4‑pro frontier 100%
  • Qwen3‑4B file‑ops specialist tinygpt 100%
  • Gemma‑4‑12B‑qat 83%
  • Qwen3‑4B (stock + plan prompt) 75%
  • Gemma‑3‑12B 33%

BFCL multi‑turn hard gate, n=12 GorillaFileSystem agentic tasks, task‑completion rate. Distillation: ~99 frontier trajectories × LoRA SFT, single Mac, no cloud train. Cheap routed specialist > bigger general on one domain is the project thesis — not "4B > 12B in general." The same file‑ops specialist regresses on out‑of‑domain breadth (60 → 42%), so TinyGPT packages it with the caveat instead of pretending it is a general planner. Full writeup, including the honest tradeoffs →

Measured, on one Mac.

Numbers below are from a stock M5 Pro / 48 GB. Reproducible via tinygpt bench and the linked artefacts. Where a number applies only to a specific preset or build, the preset is named.

Decode throughput
696tok/s

Huge preset, 221M params. 293 tok/s on the 960M Mega pilot.

TTFT warm
5.8msp99

Cold start: 24 ms on a 1B model. ITL p99: 4.9 ms.

Training step
42ms

Huge preset, 17.2× the same model in‑browser via WebGPU.

WebGPU vs WASM SIMD
12.1×

At d_model=256. Scales 2.6×→12.1× as d_model grows 96→256.

ANE M8 decode
17tok/s

28‑block Qwen3 chain on the Apple Neural Engine, layer‑chunked Core ML.

Largest fit
960M

Params trainable end‑to‑end on unified memory; 473M in a browser tab via Memory64.

Spec‑decode speedup
1.4×

serve ‑‑draft‑model, Qwen3 0.6B draft → 4B target. Lossless greedy; content‑dependent, 1.0–1.4×.

The cheapest token is the one you don’t rent.

Cloud serverless $0.20/ 1M tokens

A 4B model on Fireworks serverless (4B–16B tier). Every prompt and completion leaves the device.

The same model, on your Mac $0marginal

After a one‑time download: no metering, runs offline, and the data never leaves the laptop.

Cloud rate: Fireworks serverless pricing, 4B–16B tier, June 2026. Local marginal cost excludes electricity.

The loop the runtime closes.

Each surface emits the input the next one needs. The agent runtime records token‑preserving trajectories; those trajectories become SFT data; that data trains the next specialist; the eval gate decides whether it ships.

  1. Data tinygpt download‑dataset 22 curated sets. HF streaming, Magpie synth, GitHub issue→PR, MS‑MARCO, the‑stack‑smol.
  2. Distill tinygpt distill KL + NLL teacher→student. Local teacher (Qwen3, Gemma, DeepSeek via API or Codex CLI). Rejection‑sample on a checker.
  3. Train / SFT / DPO tinygpt sft · dpo · es Full PEFT bundle: LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA. DPO/SimPO/KTO/ORPO in one trainer.
  4. Eval gate tinygpt eval‑gate BFCL, τ‑bench, lm‑eval (MLX adapter), HumanEval+sandbox, judge shim. Exit‑non‑zero on regression.
  5. Serve tinygpt serve OpenAI + Ollama surfaces on the same socket. Continue.dev / any OpenAI client plugs in unchanged.
  6. Agent tinygpt agent Multi‑turn, tool dispatch, persistent KV, FSM‑constrained JSON, optional ‑‑cloud‑escalate.
  7. Traces → data tinygpt traces‑to‑data Every rollout is a token‑preserving .atraj. Filter, dedupe (MinHash), emit ChatML SFT JSONL. Then loop to step 02.

What ships, audited against the code.

PLAN.md is the canonical shipped/skipped/TODO ledger. Each pillar below is one subset; clicking through gets you to specific files and the papers each technique cites.

Train

Every modern pre/post‑train, one trainer.

  • Pretrain — byte‑level or BPE; WSD schedule, gradient checkpointing, spike recovery
  • SFT — response‑only masking, ChatML / Alpaca / Llama / plain
  • Preference — DPO, SimPO, KTO, ORPO (one trainer, flags)
  • PEFT — LoRA, LoRA+, DoRA (TGLA v2 on disk), VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA, LoRA‑FA
  • Optimisers — AdamW, Lion, Sophia, Muon, Adafactor, GaLore
  • Architecture — RoPE+GQA, sliding window, ALiBi, MoE, MoD, differential attn, YOCO, MTP
Distill

Frontier teacher → small student.

  • KL + NLL mix loss with temperature and α (Hinton 2015)
  • Rejection‑sampled trajectories — keep only what a checker passes
  • Gold‑clone fallback — on verifiable tasks the gold IS the trajectory; teacher‑free reproduces the same model
  • Trace replay — render in the student's own chat template via render_sft_from_traj.py
  • Headline result — 4B at frontier‑parity, documented honestly
Eval

Where the moat is.

  • Shared schema — every eval emits the same JSONL row (E0)
  • tinygpt eval‑bfcl — 10 BFCL categories, OpenAI‑compat shim into tinygpt serve
  • tinygpt eval‑tau‑bench — retail + airline, configurable user simulator
  • tinygpt run‑lm‑eval — MLX adapter for lm‑evaluation‑harness, two modes
  • tinygpt eval‑humaneval — Rust + sandbox‑exec code execution
  • tinygpt eval‑gate — CI‑grade regression gate, exits non‑zero
Serve · Agent

OpenAI‑compatible. Locally.

  • tinygpt serve — OpenAI and Ollama on the same socket; Continue.dev provider compat
  • tinygpt agent — multi‑turn loop, tool dispatch, persistent KV, FSM JSON
  • ‑‑cloud‑escalate — defer to Anthropic / OpenAI only when the local model wants to
  • Speculative decoding in serve‑‑draft‑model draft+verify, ~1.4× decode, lossless greedy; vanilla/Medusa/EAGLE‑2 heads in the CLI
  • StreamingLLM sink, KV‑cache quant (KIVI), prefix caching
  • Token‑preserving traces.atraj rollouts feed next round of SFT
Interp

Open the model up.

  • Logit lens · tuned lens (trainable per‑layer probes)
  • Activation patching — zero + donor‑swap, on (layer, position)
  • Per‑layer ablation · attention heatmap
  • Linear probes — trainable per‑layer classifiers, .lp sidecar
  • ROME · MEMIT — rank‑1 and rank‑K fact editing, multi‑layer with key‑norm weighting
  • Sparse autoencoderstinygpt sae, group‑SAE, SAELens / Neuronpedia export
In a browser tab

The same model, no install.

Hand‑written WGSL kernels train a GPT‑2 in your tab. Blocked 4×4 matmul (5.18× kernel speedup at 2048³), FA2 forward + backward in WGSL, Memory64 lifts the 4 GB tab heap so a 473M‑param model allocates cleanly. End‑to‑end parity vs. the WASM reference: ≤ 2.5% loss drift.

Open the playground → See the speedup curve →

Run it.

Two paths. The Mac path is the primary one; the browser path is for when you want everything in‑tab, no toolchain, no install.

On a Mac

macOS 14+ · Xcode · Apple Silicon · MLX‑Swift

# build the CLI
cd native‑mac
export DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer
xcodebuild -scheme tinygpt -destination 'platform=macOS,arch=arm64' \
  -derivedDataPath .xcode-build build

# one‑command quickstart: data → specialist → chat
tinygpt quickstart --data my.jsonl

# or distill from a local teacher
tinygpt distill --teacher qwen3-4b --student huge \
  --data ./traces --out ./student.tinygpt

# then serve it OpenAI‑compatible
tinygpt serve --model ./student.tinygpt --port 8080

Full build instructions →

In a browser tab

WebGPU · Chrome 113+ / Safari 18+ · no install

The playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired in — attention heatmap, logit lens, ablation, patching — under one "Inspect & evaluate" panel.

  • Parity‑tested against the WASM reference to ≤ 2.5% loss drift
  • OPFS checkpoint persistence — a run survives a tab refresh
  • Capability detection picks the right preset for your machine

Open the playground →

Honest scope.

What this is

  • A single‑developer project, shipping in public, MIT.
  • Mac‑first — M‑series, unified memory, MLX‑Swift.
  • A factory for specialists, not a general assistant.
  • An OpenAI‑compatible runtime any client already speaks.
  • A research substrate for interp, eval, and distillation.

What this isn’t (yet)

  • A general‑skill specialist. The frontier‑parity 4B dropped on out‑of‑domain BFCL (60 → 42%) — real catastrophic forgetting. Mixed‑backend distillation is the documented fix.
  • Multi‑GPU or distributed training. One device, single Mac.
  • A cloud product. Nothing leaves the laptop unless ‑‑cloud‑escalate is set.
  • An enterprise platform. No SSO, no tenancy, no SLA.
  • A finished story. The roadmap shows what’s shipped and what’s next.

Read the journey.

Decision logs and per‑technique explainers; each ties back to the paper it cites and the file where it lives.