← TinyGPT · docs · devlog · roadmap · speedup
source: docs/evolution_strategies.md · view on GitHub ↗

Evolution Strategies — gradient-free training

ES is a finite-difference approach to optimisation: at every step, we sample K random perturbations of the current parameters, evaluate each perturbed model on a shared batch, and update the parameters along the reward-weighted average noise direction.

Useful when:

Reference: Salimans et al., 2017, “Evolution Strategies as a Scalable Alternative to Reinforcement Learning” (arXiv:1703.03864).


Command

tinygpt es <model.tinygpt> --corpus <text> \
    --steps 200 --population 40 --sigma 0.02 --lr 0.01 \
    --out es-trained.tinygpt

Note: ES is byte-level-only in this first cut, and operates on from-scratch models (the model’s parameters are saved through the existing .tinygpt manifest). The starting checkpoint can be a fresh-train output or any prior-saved model.

The algorithm

Per ES step:

  1. Snapshot base parameters w (the current model state).
  2. Sample a shared batch — same data for every population member.
  3. For each of K/2 pairs, draw noise ε ~ N(0, I) shaped like w, evaluate L_+(ε) = loss(w + σε) and L_-(ε) = loss(w - σε).
  4. Reward = -loss (higher is better).
  5. Centre the rewards by subtracting the mean across all K samples. Standard variance-reduction trick.
  6. Estimate the gradient via the antithetic estimator: dir = Σ_pairs ((R_+ - R_-) / 2) · ε
  7. Apply the step: w ← w + (lr / (K · σ)) · dir

The antithetic pairing — using and for each random vector — cuts the gradient-estimate variance roughly in half for the same K samples vs. one-sided estimation. Salimans 2017’s headline trick.

Hyperparameter notes

What ES is NOT

Where to look