Model guide — building TinyGPT from scratch

Phase 1–2. Build a tiny GPT-style causal language model. First goal is correctness, not impressive output.

Exact numbers live in configs/model.byte-tinygpt-v0.json and configs/training.json — this doc explains them.

1. What you are building

A tiny GPT-style causal language model:

input tokens
  → token embeddings
  → position embeddings
  → transformer blocks
  → final layernorm
  → logits over vocabulary
  → next-token prediction loss

For v0, use a byte-level tokenizer: vocab_size = 256. Every byte is a token. This avoids all BPE / tokenizer complexity.

2. MVP model spec

{
  "model_name": "byte-tinygpt-v0",
  "vocab_size": 256,
  "context_length": 128,
  "n_layers": 4,
  "n_heads": 4,
  "d_model": 128,
  "d_mlp": 512,
  "dropout": 0.0,
  "tie_embeddings": true,
  "dtype": "float32"
}

Expected size of the reference config above: roughly 0.8M parameters. Intentionally small. The browser playground exposes a preset table from 360k (Small) to ~470M (Behemoth via Memory64), backed by the same architecture.

Why float32 everywhere? All training (Python, WASM, WebGPU) uses float32 for numeric stability — gradients on tiny models are unforgiving and lower precision multiplied the loss-drift budget faster than it bought speed. f16 lives in the project as an inference-only path, gated behind the end-to-end parity tests (see the “f16-packed storage” entry in the README’s “Negative results” section for what didn’t pan out and why).

3. Data requirements

Plain text only. See data/README.md for good/bad sources and dataset sizes. Byte-level: 1 byte ≈ 1 token, so a 1 MB file ≈ 1 million tokens.

Stage	Size	Purpose
Smoke test	1–10 KB	Check loss decreases
Overfit test	10–100 KB	Prove gradients are correct
Demo dataset	500 KB–5 MB	Realistic browser demo
Stress test	10–100 MB	Later only

4. Dataset pipeline

raw text → UTF-8 bytes → integer token array → train/val split
         → random batch sampler → (x, y) pairs

tokens = [72, 101, 108, 108, 111, ...]
x = tokens[i : i + context_length]
y = tokens[i + 1 : i + context_length + 1]

Split 90% train / 10% val. Write a dataset manifest — the hash is what makes checkpoint resume reproducible:

{
  "dataset_id": "sha256_of_raw_bytes",
  "name": "my_blog_posts.txt",
  "raw_bytes": 1249301,
  "token_count": 1249301,
  "tokenizer": "byte-v1",
  "train_split": 0.9,
  "val_split": 0.1,
  "seed": 42
}

5. Architecture details

Embeddings

token_embedding:    [vocab_size, d_model]
position_embedding: [context_length, d_model]
x = token_embedding[token_ids] + position_embedding[position_ids]

Transformer block — use pre-LayerNorm

x = x + attention(layernorm(x))
x = x + mlp(layernorm(x))

Pre-LayerNorm is easier to train than post-LayerNorm.

Causal self-attention

q = x @ Wq;  k = x @ Wk;  v = x @ Wv
scores = q @ k.T / sqrt(head_dim)
scores = causal_mask(scores)
attn   = softmax(scores)
out    = attn @ v
out    = out @ Wo

Shapes (B batch, T seq, C d_model, H heads, head_dim = C / H):

B = 16   T = 128   C = 128   H = 4   head_dim = 32

MLP

Linear(d_model → 4 * d_model)  →  GELU  →  Linear(4 * d_model → d_model)

For d_model = 128: 128 → 512 → 128.

Output head — tied embeddings

x = final_layernorm(x)
logits = x @ token_embedding.T
output_projection_weight = token_embedding_weight

Tied embeddings reduce parameter count and usually improve tiny models.

6. Loss function

Next-token cross-entropy. For a 256-byte vocab:

initial_loss ≈ ln(256) ≈ 5.54

Condition	Expected
Random model	loss near 5.54
Repeated tiny dataset	loss falls fast
Loss does not fall	bug in model / backprop / data
Loss becomes NaN	learning rate, softmax, grad explosion, bad init

7. Training config

{
  "batch_size": 16,
  "learning_rate": 0.0003,
  "optimizer": "adamw",
  "betas": [0.9, 0.95],
  "eps": 1e-8,
  "weight_decay": 0.1,
  "grad_clip": 1.0,
  "max_steps": 10000,
  "eval_interval": 100,
  "sample_interval": 500,
  "checkpoint_interval": 500,
  "seed": 42
}

Loss unstable → lower LR 0.0003 → 0.0001.
Loss too slow on tiny data → raise LR 0.0003 → 0.001, but only after verifying gradients.

8. Training loop

for step in range(max_steps):
    x, y = get_batch("train")
    logits = model.forward(x)
    loss = cross_entropy(logits, y)

    model.zero_grad()
    loss.backward()
    clip_grad_norm(model.parameters(), 1.0)
    optimizer.step()

    if step % eval_interval == 0:        val_loss = evaluate()
    if step % sample_interval == 0:      sample_text = generate(prompt)
    if step % checkpoint_interval == 0:  save_checkpoint()

In the browser this becomes: Web Worker → get batch → WASM/WebGPU forward → backward → optimizer step → post progress to UI. See browser_notes.md.

9. Implementation order

Step 1 — Python / PyTorch reference (do this first)

Deliverables: model.py, dataset.py, train.py, sample.py. Goal: train the reference 0.8M-param config on 100 KB of text; loss decreases; sampling works; checkpoint reloads. Use Karpathy’s nanoGPT as a structural reference — not something to copy blindly.

Step 2 — tiny model from scratch

Reimplement in TypeScript / C++ / Rust. For browser learning: a TypeScript reference plus a C++/Rust WASM backend. Do not write a general autograd engine — you only need backprop for: Linear, Embedding, LayerNorm, GELU, Softmax, Attention, CrossEntropy, AdamW.

Steps 3–6 — WASM, Web Worker, checkpointing, WebGPU

See browser_notes.md.

10. Required tests

The full list and rationale is in ../tests/README.md. The most important one:

Can it overfit a tiny repeated dataset? If not, scaling is pointless.

References

nanoGPT — minimal GPT training/finetuning repo: https://github.com/karpathy/nanoGPT
build-nanogpt — step-by-step construction: https://github.com/karpathy/build-nanogpt