TinyGPT

in-browser playground
Feedback

Train a small GPT — the kind of model behind ChatGPT, only ~0.8 million parameters instead of a trillion — from scratch, right here in this tab. No server, no install. Watch the loss curve fall, then ask it to write you a sentence. Source ↗

Train

Corpus
picks load automatically · or paste your own text below
Hyperparameters — preset above sets these. Click to edit individually.
Model
Estimated run time
Advanced
Your machine detecting… computing…

computing…

Want to go further?
4 Resources curated rabbit holes — 3Blue1Brown, Karpathy, papers, and this repo's own docs

Start here (visual + intuitive)

Go deeper (code + papers)

This project's own write-ups

Related projects

5 Diagnostics & how to go faster WebGPU matmul benchmark, the Python CLI for 10M+ models, and what makes training fast

Making it faster — what works, in order of impact

  1. Train locally with Python (50–100×). The python_ref/train.py path uses PyTorch with CUDA / Apple MPS and runs on every core. A 10M model trains in ~24 s / 1k steps on an M5 Pro — comfortable iteration speed.
  2. WebGPU backend (≈3–10× on real hardware). The shaders are correct end-to-end (24/24 kernel parity-checked); real-GPU speedup is unmeasured here because the project's CI ran on a software adapter (swiftshader) — see docs/notes.md §10. Try it on your machine and watch tokens/sec.
  3. WASM SIMD (1.6× over scalar WASM). Already on if your browser supports it — the green pill at top confirms. Four floats per cycle in the matmul inner loop instead of one.
  4. Bigger batch (sub-linear). Better cache utilisation per step, fewer kernel dispatches. Memory grows linearly in batch × ctx; the matmul cost dominates so it's near-free up to your RAM limit.
  5. Smaller model (linear). Throughput scales roughly as 1/params for transformer training, because the matmuls grow as params × batch × ctx. Drop d_model from 96 to 48 and per-step time quarters (d² in the inner kernel).
  6. Multi-threaded WASM (4–8×, not implemented here). Would need SharedArrayBuffer and worker threads. Open box.

Full speed write-up: docs/performance.md ↗ · The journey of every lever — shipped, blocked, open, and why: the performance journey ↗

Runs the same matmul on the WebGPU compute kernel and the WASM kernel, checks they agree, and reports the speedup. Needs Chrome / Edge 113+.

Not run yet.

Train larger models locally

In-browser is single-threaded WASM — comfortable up to ~1M params. For 5–25M+ (a few minutes on a laptop), run the Python reference where it uses your GPU (Apple MPS / CUDA).

git clone https://github.com/sarthakagrawal927/tinygpt
cd tinygpt
python -m venv python_ref/.venv && source python_ref/.venv/bin/activate
pip install -r python_ref/requirements.txt

# measure how fast your machine trains:
python python_ref/bench.py

# train a ~10.8M model on your own text:
python python_ref/train.py --model-config configs/model.small.json \
    --data your-text.txt --out checkpoints/run

# generate from it:
python python_ref/sample.py --checkpoint checkpoints/run --prompt "Once "

Keyboard shortcuts

?Show this sheet EscClose any popover / dialog ⌘ / Ctrl EnterStart training ⌘ / Ctrl GGenerate from the model TTake the tour SShare this setup PPause / resume training 15Pick a size preset (Tiny → XL)

Welcome to TinyGPT

A complete transformer that trains from scratch — right here, in this tab, with no server. Load the pretrained Shakespeare model to see it work in one click, or train your own from scratch (a few minutes on the small preset, ~15 minutes on the larger ones). Every layer was written by hand. Want a 90-second tour first?