4,192-parameter transformer; same trained weights run on every layer of the stack from pure Python to FPGA. Hardware: Apple M4 Pro (24 GB, 10P + 4E cores, macOS 15.6.1, clang 17). Reference: TALOS-V2 on Cyclone V at 56 MHz. · overview · ▶ try the live WASM demo · source
Each implementation runs one autoregressive stream, batch=1, char-by-char with multinomial sampling. Token rate measured after warmup. Lower on this chart = slower. Log scale.
| Implementation | tok/sec | vs FPGA | What dominates |
|---|---|---|---|
| pure Python | 4,332 | 0.08× | Python loop overhead per MAC |
| MLX fp32 (CPU) | 3,873 | 0.07× | Per-op dispatch through the array framework |
| MLX fp32 (GPU) | 1,865 | 0.04× | GPU kernel launch latency per op |
| NumPy fp32 | 24,223 | 0.46× | Reduce/dispatch/typecheck (~96% non-math) |
| TALOS-V2 (FPGA, 56 MHz) | 53,000 | 1.00× | Reference baseline |
| WASM in browser (Chrome, M4 Pro) | 1,341,206 | 25.30× | JIT-compiled WASM, ~35% of LUT-optimized native C+NEON; mean of 5 runs in regular Chrome (CV 0.18%). Electron-embedded Chromium shows ~2.04M; see wasm/bench_runs.txt |
| C+NEON Q4.12 | 2,191,219 | 41.34× | int16 multiply on NEON |
| C+NEON fp32 | 3,820,760 | 72.09× | Whole model + KV cache fits in L1 |
Independent autoregressive streams running on separate cores. M4 Pro has 10 P-cores + 4 E-cores; saturation visible around 14 streams.
Same dataset (32K names from karpathy/makemore), each step adds exactly one concept on top of the last.
NLL per character on the training corpus.
| Step | What's added | NLL | Params |
|---|---|---|---|
| 1. Counting bigram | Add-one smoothed frequency table; no learning | 2.4546 | 729 |
| 2. Neural bigram | Manual gradient descent over a 27×27 logit matrix | 2.4617 | 729 |
| 3. Autograd | Same model, but graph-based backprop replaces hand-derived gradients | 2.4617 | 729 |
| 4. MLP | Embeddings, 3-character context, hidden layer with tanh | 2.20 | 4,009 |
| 5. GPT, 1 head, SGD | Self-attention + RMSNorm + position embeddings; SGD plateaus | 2.29 | 4,240 |
| 6. GPT, 4 heads, Adam | Multi-head attention + Adam; same architecture family as TALOS | 2.21 train / 2.20 val | 4,240 |
All steps trained from scratch on the same corpus. Step 6 uses a 90/10 split over unique names (the dataset has 32,033 rows but only 29,494 unique names; deduping before the split prevents leakage); the assertion in the code verifies the train and val name sets are disjoint. Val NLL ≤ train NLL, so no measurable overfitting. Step 6 has learnable RMSNorm gains and a final RMSNorm (4,240 params); the TALOS-trained reference used in the WASM benchmark has parameter-free RMSNorm and no final norm before the LM head (4,192 params). Same family, different parameter count and weights. Both land near NLL 2.2 — a model-capacity floor at this size, not a dataset floor (a larger model would dip below).
Trained TALOS-V2 weights, requantized to various Q-formats with per-tensor symmetric scaling (int8 / int4 / int2). NLL measured on the first 500 names of the training corpus. The model is at most 8-bit-effective — per-tensor int8 shows no measurable degradation on this slice (NLL drift below the sampling noise floor); Q4.12 (16-bit) carries headroom unused by this model's weight distribution. Note: this is a same-corpus eval, not a held-out set, so it bounds quantization "harm" but doesn't certify it.
| Format | max |err| | NLL | Sample quality |
|---|---|---|---|
| fp32 (baseline) | 0.00000 | 2.2633 | kana, keelan, alilan, ariel, cairi |
| Q4.12 strict (TALOS) | 0.00012 | 2.2633 | bit-identical |
| Q3.13 strict | 0.00006 | 2.2633 | bit-identical |
| per-tensor int8 | 0.00525 | 2.2631 | identical samples |
| per-tensor int4 | 0.09499 | 2.3009 | jaliny, mariel, calin (different but plausible) |
| per-tensor int2 | 0.65911 | 3.3499 | broken rahftryckagnfern, fpvavsjdzjccwpvt |
cProfile on bench_numpy.py, sorted by self-time. Only ~4% of wall-clock is in the actual matrix kernel
(c_einsum). The other 96% is reduction setup, type checks, dispatcher overhead.
The same forward pass in seven implementations, ranked by tok/sec relative to the FPGA.
microgpt_inf.wasm on the same M4 Pro hardware runs at
~1.34M tok/sec in regular Chrome 145 and ~2.04M tok/sec in Electron 41's bundled
Chromium 146 — a 50% delta from runtime alone. WebAssembly performance is
a band per (binary, host) pair, not a single number per machine. The published
headline (25.30× FPGA) uses the regular-Chrome figure because that is what
most readers will see.