Session Retrospective: posttrainllm, the Long Working Session

This document captures a single long working session on posttrainllm — what was believed coming in, what the contradictions turned out to be, what got shipped, and what the transferable lessons were. The session ranged across performance measurement, hyperparameter archaeology, training data, host-specific ABI bugs, demo polish, generation throughput, and the gap between Node-WASM and browser-WASM. It produced visible changes to README.md, BLOG.md, the Astro pages under browser/src/pages/, browser/src/types.ts, browser/src/main.ts, browser/src/tour.ts, browser/src/explainers.ts, docs/archive/status.md, docs/deploy.md, a new docs/archive/lessons.md, a freshly trained browser/public/demo.tinygpt, the default browser/public/shakespeare.txt corpus, and a new Node reproducer at tests/test_wasm64_xl_node.mjs. Most of the deliverables were small. The findings behind them were not.

1. The speedup number that turned into a curve

Coming into the session, the project’s headline was a single number: “9.7x end-to-end speedup over the Python reference.” That number had earned its place — it had been measured, it was reproducible, and it appeared in the README, the blog post, the devlog, the roadmap, and the dedicated browser/src/pages/speedup.astro page. It was the project’s identity.

The contradiction surfaced when the measurement was actually re-run across model sizes instead of just the Medium preset. The scratch measurement scripts that now live at browser/measure_xl.mjs, browser/measure_mega.mjs, browser/measure_behemoth.mjs, and browser/measure_curve.mjs walked Small / Medium / Large / XL and produced a clean monotonic curve: roughly 2.6x at Small, 6.8x at Medium, 9.3x at Large, and 12.1x at XL. The 9.7x number was real, but it was the value of a function evaluated at one point. Quoting it as the headline simultaneously undersold the largest models (where the win is much bigger) and oversold the smallest (where the win is modest).

The fix was to publish the function, not a point on it. README.md, BLOG.md, browser/src/pages/devlog.astro, browser/src/pages/roadmap.astro, browser/src/pages/speedup.astro, and docs/archive/status.md were all reworked so the speedup is shown as a curve indexed by model size, with the Medium point called out for continuity with prior writing. docs/performance.md already framed perf this way for individual kernels; the user-facing surfaces now match.

The transferable lesson is uncomfortable: if a performance number is a function of a tuneable, publishing one point on it is a marketing artifact, not a measurement. The curve is the truthful representation, and the curve is also more interesting — the slope, not the intercept, is what the project’s optimisations are buying. See docs/performance.md and docs/archive/lessons.md for the longer-form version of this argument.

2. The learning rate that was 10x too hot for months

The pre-session belief was that the loss floor on real corpora (around 2.45 nats on TinyShakespeare-class data, regardless of model size) was a modelling ceiling — proof that the model architecture or the optimiser was missing some piece the Python reference had. Two days of the session were spent suspecting GPU kernels: chasing FlashAttention-2 numerics, re-running parity tests on matmul tiles, auditing the softmax in attention, looking for an off-by-one in the layernorm gradient. Every one of those came back clean. The kernel parity tests against the Python reference all passed.

The contradiction surfaced almost incidentally, while reading the defaults wired into the UI. browser/src/types.ts:35 had the browser default learningRate set to 3e-3. The HTML input on the playground page at browser/src/pages/index.astro:2621 carried value="0.003". The Python reference defaults to 3e-4. The browser was training every model the visitor saw at ten times the reference learning rate. With Adam-class optimisers on small transformer stacks, 3e-3 is firmly in the “loss can descend for a while and then plateau on a noise floor it cannot climb out of” regime. The 2.45 floor was not a ceiling. It was the noise.

The fix was a one-character edit in types.ts and a one-token edit in index.astro, both setting the default to 3e-4. After that change, a 5000-step run on the Huge preset (the same one used to bake the new shipped demo model) converged to a training loss of 1.30 — well inside the range where samples read as legible Shakespeare instead of mangled n-gram soup. The kernels had been fine all along.

The transferable lesson is the one this session paid the most for. Kernel parity tests catch wrong math. Nothing caught wrong hyperparameters because no test asserted that the defaults the user sees match the defaults the reference uses. The reference path is the oracle not just for kernel outputs but for the entire default configuration surface — learning rate, beta1, beta2, weight decay, warmup steps, init scale, dropout. A test that walks browser/src/types.ts and the input value attributes in index.astro and asserts equality with the Python reference defaults would have caught this on day one. That test now belongs on the roadmap. The longer write-up of this is in docs/archive/lessons.md.

3. The training corpus was 863 bytes

The pre-session belief was that the playground’s “train from scratch” path was running on Shakespeare, because that is what the surrounding copy promised and because every screenshot in BLOG.md and the README implied it. The default <textarea id="corpus"> inside browser/src/pages/index.astro is what actually feeds the trainer when the user clicks Train without uploading their own data.

The contradiction surfaced when training a Huge-preset model on the default corpus produced suspiciously low loss numbers very quickly. The corpus turned out to be an 863-byte inline meta-explainer paragraph — the kind of thing that explains what posttrainllm is to a first-time visitor — sitting where actual training text should have been. A 9.6M-parameter model on 863 bytes is not learning a language; it is memorising one short paragraph and then thrashing on the cycle. Any “training curve” produced under this configuration was a measurement of how fast the model could overfit one paragraph.

The fix had two parts. First, the actual TinyShakespeare corpus (about 1.1 MB) was added at browser/public/shakespeare.txt and also archived under data/examples/shakespeare.txt. Second, the playground’s init code in browser/src/main.ts was changed to fetch /shakespeare.txt on page load and populate the textarea with it, so the default training data the user sees is now the same data the screenshots and the demo model were trained on. The inline explainer text moved out of the textarea and into surrounding prose where it belonged.

The transferable lesson is that the default training data is part of the demo’s promise. A playground that ships with a 863-byte corpus is making a claim about the experience the visitor will have, and that claim was wrong. Default data, like default hyperparameters, is part of the contract.

4. The Memory64 module had never actually run in Node

The pre-session belief was that “Node passes XL” was strong evidence the kernels were correct at the largest preset, because the test suite was running an XL-class workload end-to-end in Node-WASM and reporting green. That claim had been repeated in several places (the devlog, internal notes, and informally during debugging) as a reason to look elsewhere when browser-side XL behaviour was suspicious.

The contradiction surfaced when the loader logic for the 64-bit pthread+Memory64 build was traced carefully. tests/bench_wasm.mjs — the Node harness that the “Node passes XL” claim was based on — loads tinygpt.js, the 32-bit single-threaded build. The 64-bit pthread+Memory64 build (tinygpt64.js plus tinygpt64.wasm) is the one the browser actually uses for XL and above. The 64-bit module had never been loaded from Node by any existing test. The evidence supporting “the kernels are fine at XL” was evidence about a different binary.

A reproducer was built at tests/test_wasm64_xl_node.mjs that loads tinygpt64.js from Node and tries to run a forward pass. It fails immediately at the JS↔WASM bridge, before any kernel runs. _malloc returns a JavaScript Number (a 53-bit-safe integer pointer), but the cwrap-generated wrappers for the kernel entry points expect their pointer arguments as BigInt because the module is compiled with -sMEMORY64=1. The conversion throws a TypeError at the first kernel call. The browser was calling into this same broken ABI surface for any model at XL or above. Tracked as task #66; the fix in progress is to wrap the allocator returns into BigInts at the JS shim layer rather than at every call site.

The transferable lesson is that parity is per-host, not per-binary. “It works in Node” is a statement about a specific Node loader running a specific compiled artifact. If the browser uses a different compiled artifact, the Node test gives you exactly zero coverage of the browser path. The test matrix needs to be host × artifact, not just artifact. This is also written up in docs/archive/lessons.md and referenced from docs/archive/status.md as the current top open issue.

5. A shipped demo model that produces something a human can read

The pre-session demo at browser/public/demo.tinygpt was a roughly 0.8M-parameter Medium-class checkpoint. Loaded into the playground, it produced output that was syntactically plausible at the character level but semantically gibberish to anyone who was not already familiar with the project’s training loop. To a first-time visitor — the primary audience for a “load the pretrained model and watch it generate” path — it read as a broken demo.

The contradiction surfaced when the new corpus and the corrected learning rate were combined: a Huge-preset model could now converge to a loss range where samples were recognisably Shakespearean. The 0.8M Medium model had been doing the best it could under the wrong defaults; with the right defaults and a real corpus, a 9.6M Huge model was within reach of producing demo-quality output in about fifteen minutes of training.

The fix was to actually train and ship that model. browser/train_demo.mjs drives Playwright against a local dev build of the playground, runs 5000 steps on the Huge preset against full Shakespeare, and downloads the resulting .tinygpt file to replace browser/public/demo.tinygpt. The first run of this script crashed at the download step: #modelMenuBtn is hidden behind a parent menu, and Playwright’s .click() refuses to interact with hidden elements. Fixed by switching that one interaction to page.evaluate(() => document.querySelector('#modelMenuBtn').click()), which bypasses Playwright’s visibility check by triggering the DOM event directly. The final model loads cleanly and produces samples in the loss-1.30 range from cold start.

The playground banner copy was reworked to make the two paths explicit instead of leaving the user to guess: “Load the pretrained model” (one click, instant, produces readable Shakespeare) versus “Train your own from scratch” (about fifteen minutes, watch a real training curve, end up with your own checkpoint). Previously the banner conflated these into one ambiguous call-to-action. The new framing matches what the underlying buttons actually do.

The transferable lesson is that “ship a demo model” is one task on the surface and three tasks underneath: (a) the model has to actually train to a loss that produces good samples, which depends on hyperparameters being correct; (b) the training data has to be real; (c) the path the user takes to load it has to be unambiguous and tested end-to-end through the same UI a visitor will use.

6. Generation feels slow because it actually is slow

The pre-session belief about the generation UX was that the perceived sluggishness on long completions was mostly the typewriter animation — that the model was producing tokens quickly and the UI was just metering them out. This belief was easy to hold because no token-rate measurement existed; the only visible artifact was the animation.

The contradiction surfaced when a stopwatch was put against the actual end-to-end generation path. There is no KV cache: every newly generated token re-runs a full forward pass over the entire context, so the per-token cost grows linearly (and the cumulative cost quadratically) in the sequence length. There is no streaming: the worker thread runs the entire decode loop to completion and only then posts the finished sequence back to the main thread, which then runs a typewriter animation over already-finished text. The animation is decorative, not load-bearing. The model is genuinely producing tokens at a slow rate, and the UI cannot show progress because the worker is not sending any until the end.

The fix is in three pieces, only the first of which landed in this session. browser/src/gpu_model.ts:generate was changed to accept an optional onToken callback so the inner decode loop can yield each token as it is produced, instead of accumulating into an array and returning at the end. The remaining two pieces — wiring onToken through browser/src/worker.ts as a postMessage stream and then through browser/src/main.ts as an incremental DOM update — are tracked as task #72. The third piece, an actual KV cache that reuses prior key/value tensors across decode steps, is a larger refactor that is captured on the roadmap but not scheduled for this session. The cost model is documented now so the next person picking this up does not have to re-derive it.

The transferable lesson is that “feels slow” needs to be decomposed into “is slow” (algorithmic cost), “looks slow” (UI metering), and “delivers late” (streaming). A typewriter animation over a batch-completed sequence is a UX lie: it pretends to stream something that has already finished. The animation should follow real token arrivals, or it should be removed.

7. Browser-WASM is roughly 15x slower than Node-WASM at small models

The pre-session belief was that WASM is WASM — that once the compiled module was loaded, the host environment was a thin layer and the per-step training time would be substantially the same between Node and the browser. This belief is approximately true at large model sizes, where each kernel call dominates and the per-call overhead is amortised over significant work.

The contradiction surfaced at the small end of the curve, in the same measurement sweep that produced the speedup curve. At Small, browser-WASM is about 15x slower per step than Node-WASM on the same machine running the same compiled artifact (for the 32-bit module, which is the one both hosts can actually load — see arc 4 for the 64-bit story). The kernels themselves are not slower; the cost is environmental. pthread orchestration through Atomics.wait and SharedArrayBuffer carries non-trivial per-call overhead in the browser. Every message that crosses the worker boundary serialises through postMessage. The browser’s WASM instantiation path and its memory model carry costs that the Node host does not.

There is no clean fix here, and the session did not attempt one. The framing change is that this is a host cost, not a kernel cost, and it dominates only at small model sizes where the per-call overhead is comparable to the per-call work. As models get bigger, the ratio collapses — which is part of why the speedup curve in arc 1 climbs with model size: not just because the Python reference scales worse, but because the WASM host overhead becomes invisible. This is documented now in docs/performance.md alongside the kernel-level perf numbers.

The transferable lesson is that perf comparisons need to name the host. “WASM is fast” is not a complete sentence. Node-WASM and browser-WASM are different runtimes with different overhead profiles, and a measurement in one is not a prediction for the other — especially at the boundaries of model size where overhead and work are comparable.

What’s still open

The session closed with the following items deliberately left unfinished. Each is tracked in the project’s task list and the items here cross-reference docs/archive/status.md for current state.

Memory64 ABI fix (task #66). tinygpt64.js cannot be loaded from any host without the _malloc-returns-Number / kernel-expects-BigInt mismatch being addressed. The fix is a thin wrapper layer at the JS shim that promotes allocator returns to BigInt at the boundary so call sites do not all need to be updated. Until this lands, XL and above in the browser are running on broken footing — the kernels themselves are correct, but the ABI bridge is not, and the symptoms surface in confusing ways depending on which entry point happens to be called first.
KV cache plus streaming generation (task #72 and a follow-on). The onToken callback now exists in gpu_model.ts:generate. It needs to be wired through worker.ts as an incremental postMessage stream and through main.ts as a real DOM update on each arrival, replacing the decorative typewriter animation that runs after the fact. After that, the KV cache itself — reusing prior key/value tensors across decode steps so per-token cost stops being linear in context length — is the larger architectural change that actually makes long generations fast rather than just visibly streaming.
More curated corpora. Shakespeare is the right default but the wrong only-option. A small set of curated corpora (a code corpus, a markdown-prose corpus, a small dialogue corpus) would let the “train your own from scratch” path produce visibly different model behaviour depending on the visitor’s choice, which is both pedagogically valuable and a better demo. Each corpus needs to be small enough to fit comfortably in browser memory and large enough to support a Huge-preset training run without collapsing to memorisation.
The gallery. A page showing other people’s trained models — uploaded checkpoints with sample outputs, training curves, and the corpus and hyperparameters used. This was scoped in the roadmap but not started. The shipping demo model and the working .tinygpt upload path are prerequisites; both now exist.
The Mac app. A native shell around the WASM core, packaged as a real desktop application. Captured on the roadmap; no work in this session. The Memory64 ABI fix is a prerequisite because the Mac app will want the 64-bit module for serious model sizes.
More comprehensive eval and val-loss reporting in the playground. Training currently reports only training loss. A held-out validation split with periodic val-loss reporting (and a visible train/val curve overlay) would make overfitting visible to the user instead of hiding behind a single descending number, and would have surfaced the 863-byte-corpus problem in arc 3 immediately — a train loss collapsing while val loss stays high is the canonical fingerprint of memorisation. docs/validation_report.md (the evaluation-and-safety appendix, merged from the former evaluation.md) sketches the design; the playground wiring is not yet done.

The throughline across all seven arcs is that several of the project’s most-quoted claims turned out to be measurements of the wrong thing — the wrong slice of the speedup curve, the wrong learning rate, the wrong corpus, the wrong binary, the wrong layer of the latency stack. The fixes were individually small. The pattern is worth keeping in view: every claim about performance, correctness, or user experience is implicitly a claim about what was measured and under what defaults, and “what was measured under what defaults” is the part that quietly drifts when nobody is testing it. The longer-form version of this argument lives in docs/archive/lessons.md; the current state of every open item lives in docs/archive/status.md; the canonical perf numbers live in docs/performance.md; and the public-facing narrative for all of the above is in BLOG.md.