microGPT in your browser

4,192-parameter transformer compiled from C to WebAssembly. Same trained weights as TALOS-V2's FPGA. Generates plausible names character-by-character via multinomial sampling. · overview · full report · source

Loading WebAssembly module…

Generate

— tok/sec — vs FPGA

Press a button to start.

How this compares

Same model, different substrates. The "your browser" row updates after you run the benchmark.

Implementation	Source	tok/sec	vs FPGA
MLX GPU	measured locally, M4 Pro (24 GB)	1,865	0.04×
MLX CPU	measured locally, M4 Pro (24 GB)	3,873	0.07×
pure Python	measured locally, M4 Pro (24 GB)	4,332	0.08×
pure Python (M5 Pro)	talos-vs-macbook README	8,491	0.16×
NumPy fp32	measured locally, M4 Pro (24 GB)	24,223	0.46×
NumPy fp32 (M5 Pro)	talos-vs-macbook README	47,589	0.90×
TALOS-V2 (FPGA, 56 MHz)	v2.talos.wtf write-up · repo	53,000	1.00×
WASM browser (M4 Pro 24GB, regular Chrome)	bench_runs.txt · raw runs	1,341,206	25.30×
WASM browser (M4 Pro 24GB, Electron preview)	bench_runs.txt · raw runs	2,038,131	38.46×
your browser (WASM, live)	live, this page	—	—
C fp32+AVX2 (Intel)	microgpt-c README	2,631,689	49.65×
C+NEON Q4.12 (M4 Pro)	measured locally, M4 Pro (24 GB)	2,191,219	41.34×
C+NEON Q4.12 (M5 Pro)	talos-vs-macbook README	3,373,950	63.66×
C+NEON fp32 (M4 Pro)	measured locally, M4 Pro (24 GB)	3,820,760	72.09×
C+NEON fp32 (M5 Pro)	talos-vs-macbook README	6,713,978	126.68×
C+NEON ×14 streams (M4 Pro)	measured locally, M4 Pro (24 GB)	32,894,149	620.6×
C+NEON ×18 streams (M5 Pro)	talos-vs-macbook README	85,967,850	1,621.7×

"Source" indicates whether the number was measured locally on the M4 Pro (24 GB, 10P+4E) or quoted from the upstream repos as published. Important caveats: (1) cross-machine numbers (M5 Pro vs M4 Pro) are not apples-to-apples — M5 Pro single-stream is roughly 1.5–1.8× M4 Pro due to clock + IPC changes; (2) the C+NEON benchmark precomputes (token, pos) embedding + RMSNorm + Q/K/V lookup tables outside its timed loop, while the WASM forward computes that work every step. WASM-vs-C+NEON is therefore "browser WASM vs LUT-optimized native," not strict same-workload comparison.

What's running. The C source for the entire forward pass (~150 lines) is in microgpt_inf.c, compiled with emcc -O3 -msimd128 -ffast-math. Weights are 16,768 bytes of fp32 loaded into the WASM module at startup. The model architecture matches Karpathy's microGPT exactly: n_embd=16, n_head=4, block_size=16, vocab=27, one transformer block. Sampling uses temperature 0.5 multinomial.

Heads-up: your number depends on your browser's V8 build. The same .wasm binary on the same M4 Pro hardware measures ~1.34M tok/sec in regular Chrome 145 and ~2.04M tok/sec in Electron 41's bundled Chromium 146 — about a 50% delta from the runtime alone. Click Benchmark in your own browser to see what your combination of hardware + V8 produces. Both numbers are recorded in bench_runs.txt.