4,192-parameter transformer compiled from C to WebAssembly. Same trained weights as TALOS-V2's FPGA. Generates plausible names character-by-character via multinomial sampling. · overview · full report · source
Same model, different substrates. The "your browser" row updates after you run the benchmark.
| Implementation | Source | tok/sec | vs FPGA |
|---|---|---|---|
| MLX GPU | measured locally, M4 Pro (24 GB) | 1,865 | 0.04× |
| MLX CPU | measured locally, M4 Pro (24 GB) | 3,873 | 0.07× |
| pure Python | measured locally, M4 Pro (24 GB) | 4,332 | 0.08× |
| pure Python (M5 Pro) | talos-vs-macbook README | 8,491 | 0.16× |
| NumPy fp32 | measured locally, M4 Pro (24 GB) | 24,223 | 0.46× |
| NumPy fp32 (M5 Pro) | talos-vs-macbook README | 47,589 | 0.90× |
| TALOS-V2 (FPGA, 56 MHz) | v2.talos.wtf write-up · repo | 53,000 | 1.00× |
| WASM browser (M4 Pro 24GB, regular Chrome) | bench_runs.txt · raw runs | 1,341,206 | 25.30× |
| WASM browser (M4 Pro 24GB, Electron preview) | bench_runs.txt · raw runs | 2,038,131 | 38.46× |
| your browser (WASM, live) | live, this page | — | — |
| C fp32+AVX2 (Intel) | microgpt-c README | 2,631,689 | 49.65× |
| C+NEON Q4.12 (M4 Pro) | measured locally, M4 Pro (24 GB) | 2,191,219 | 41.34× |
| C+NEON Q4.12 (M5 Pro) | talos-vs-macbook README | 3,373,950 | 63.66× |
| C+NEON fp32 (M4 Pro) | measured locally, M4 Pro (24 GB) | 3,820,760 | 72.09× |
| C+NEON fp32 (M5 Pro) | talos-vs-macbook README | 6,713,978 | 126.68× |
| C+NEON ×14 streams (M4 Pro) | measured locally, M4 Pro (24 GB) | 32,894,149 | 620.6× |
| C+NEON ×18 streams (M5 Pro) | talos-vs-macbook README | 85,967,850 | 1,621.7× |
"Source" indicates whether the number was measured locally on the M4 Pro (24 GB, 10P+4E)
or quoted from the upstream repos as published. Important caveats:
(1) cross-machine numbers (M5 Pro vs M4 Pro) are not apples-to-apples — M5 Pro single-stream
is roughly 1.5–1.8× M4 Pro due to clock + IPC changes;
(2) the C+NEON benchmark precomputes (token, pos) embedding + RMSNorm + Q/K/V
lookup tables outside its timed loop, while the WASM forward computes that work
every step. WASM-vs-C+NEON is therefore "browser WASM vs LUT-optimized native," not strict
same-workload comparison.
microgpt_inf.c, compiled with emcc -O3 -msimd128 -ffast-math.
Weights are 16,768 bytes of fp32 loaded into the WASM module at startup. The model architecture
matches Karpathy's microGPT exactly: n_embd=16, n_head=4, block_size=16, vocab=27,
one transformer block. Sampling uses temperature 0.5 multinomial.
.wasm binary on the same M4 Pro hardware measures
~1.34M tok/sec in regular Chrome 145 and ~2.04M tok/sec in Electron 41's
bundled Chromium 146 — about a 50% delta from the runtime alone.
Click Benchmark in your own browser to see what your
combination of hardware + V8 produces. Both numbers are recorded in
bench_runs.txt.