← TinyGPT · docs · devlog · roadmap · speedup
source: docs/model_guide.md · view on GitHub ↗

Model guide — building TinyGPT from scratch

Phase 1–2. Build a tiny GPT-style causal language model. First goal is correctness, not impressive output.

Exact numbers live in configs/model.byte-tinygpt-v0.json and configs/training.json — this doc explains them.


1. What you are building

A tiny GPT-style causal language model:

input tokens
  → token embeddings
  → position embeddings
  → transformer blocks
  → final layernorm
  → logits over vocabulary
  → next-token prediction loss

For v0, use a byte-level tokenizer: vocab_size = 256. Every byte is a token. This avoids all BPE / tokenizer complexity.


2. MVP model spec

{
  "model_name": "byte-tinygpt-v0",
  "vocab_size": 256,
  "context_length": 128,
  "n_layers": 4,
  "n_heads": 4,
  "d_model": 128,
  "d_mlp": 512,
  "dropout": 0.0,
  "tie_embeddings": true,
  "dtype": "float32"
}

Expected size of the reference config above: roughly 0.8M parameters. Intentionally small. The browser playground exposes a preset table from 360k (Small) to ~470M (Behemoth via Memory64), backed by the same architecture.

Why float32 everywhere? All training (Python, WASM, WebGPU) uses float32 for numeric stability — gradients on tiny models are unforgiving and lower precision multiplied the loss-drift budget faster than it bought speed. f16 lives in the project as an inference-only path, gated behind the end-to-end parity tests (see the “f16-packed storage” entry in the README’s “Negative results” section for what didn’t pan out and why).


3. Data requirements

Plain text only. See data/README.md for good/bad sources and dataset sizes. Byte-level: 1 byte ≈ 1 token, so a 1 MB file ≈ 1 million tokens.

StageSizePurpose
Smoke test1–10 KBCheck loss decreases
Overfit test10–100 KBProve gradients are correct
Demo dataset500 KB–5 MBRealistic browser demo
Stress test10–100 MBLater only

4. Dataset pipeline

raw text → UTF-8 bytes → integer token array → train/val split
         → random batch sampler → (x, y) pairs
tokens = [72, 101, 108, 108, 111, ...]
x = tokens[i : i + context_length]
y = tokens[i + 1 : i + context_length + 1]

Split 90% train / 10% val. Write a dataset manifest — the hash is what makes checkpoint resume reproducible:

{
  "dataset_id": "sha256_of_raw_bytes",
  "name": "my_blog_posts.txt",
  "raw_bytes": 1249301,
  "token_count": 1249301,
  "tokenizer": "byte-v1",
  "train_split": 0.9,
  "val_split": 0.1,
  "seed": 42
}

5. Architecture details

Embeddings

token_embedding:    [vocab_size, d_model]
position_embedding: [context_length, d_model]
x = token_embedding[token_ids] + position_embedding[position_ids]

Transformer block — use pre-LayerNorm

x = x + attention(layernorm(x))
x = x + mlp(layernorm(x))

Pre-LayerNorm is easier to train than post-LayerNorm.

Causal self-attention

q = x @ Wq;  k = x @ Wk;  v = x @ Wv
scores = q @ k.T / sqrt(head_dim)
scores = causal_mask(scores)
attn   = softmax(scores)
out    = attn @ v
out    = out @ Wo

Shapes (B batch, T seq, C d_model, H heads, head_dim = C / H):

B = 16   T = 128   C = 128   H = 4   head_dim = 32

MLP

Linear(d_model → 4 * d_model)  →  GELU  →  Linear(4 * d_model → d_model)

For d_model = 128: 128 → 512 → 128.

Output head — tied embeddings

x = final_layernorm(x)
logits = x @ token_embedding.T
output_projection_weight = token_embedding_weight

Tied embeddings reduce parameter count and usually improve tiny models.


6. Loss function

Next-token cross-entropy. For a 256-byte vocab:

initial_loss ≈ ln(256) ≈ 5.54
ConditionExpected
Random modelloss near 5.54
Repeated tiny datasetloss falls fast
Loss does not fallbug in model / backprop / data
Loss becomes NaNlearning rate, softmax, grad explosion, bad init

7. Training config

{
  "batch_size": 16,
  "learning_rate": 0.0003,
  "optimizer": "adamw",
  "betas": [0.9, 0.95],
  "eps": 1e-8,
  "weight_decay": 0.1,
  "grad_clip": 1.0,
  "max_steps": 10000,
  "eval_interval": 100,
  "sample_interval": 500,
  "checkpoint_interval": 500,
  "seed": 42
}

8. Training loop

for step in range(max_steps):
    x, y = get_batch("train")
    logits = model.forward(x)
    loss = cross_entropy(logits, y)

    model.zero_grad()
    loss.backward()
    clip_grad_norm(model.parameters(), 1.0)
    optimizer.step()

    if step % eval_interval == 0:        val_loss = evaluate()
    if step % sample_interval == 0:      sample_text = generate(prompt)
    if step % checkpoint_interval == 0:  save_checkpoint()

In the browser this becomes: Web Worker → get batch → WASM/WebGPU forward → backward → optimizer step → post progress to UI. See browser_notes.md.


9. Implementation order

Step 1 — Python / PyTorch reference (do this first)

Deliverables: model.py, dataset.py, train.py, sample.py. Goal: train the reference 0.8M-param config on 100 KB of text; loss decreases; sampling works; checkpoint reloads. Use Karpathy’s nanoGPT as a structural reference — not something to copy blindly.

Step 2 — tiny model from scratch

Reimplement in TypeScript / C++ / Rust. For browser learning: a TypeScript reference plus a C++/Rust WASM backend. Do not write a general autograd engine — you only need backprop for: Linear, Embedding, LayerNorm, GELU, Softmax, Attention, CrossEntropy, AdamW.

Steps 3–6 — WASM, Web Worker, checkpointing, WebGPU

See browser_notes.md.


10. Required tests

The full list and rationale is in ../tests/README.md. The most important one:

Can it overfit a tiny repeated dataset? If not, scaling is pointless.


References