micro-gpt across the abstraction stack

A 4,192-parameter transformer (Karpathy's microGPT) implemented from scratch and benchmarked on eight substrates: pure Python, NumPy, MLX-CPU/GPU, an FPGA, hand-written C+NEON, and WebAssembly running in your browser.

Headline findings

NumPy spends ~96% of wall-clock on dispatch/typecheck overhead, only ~4% on the actual matmul kernel.
MLX-GPU (1,865 tok/sec) loses to pure Python — GPU launch overhead is the wrong shape for 4K MACs/token.
WebAssembly hits 25× the FPGA in regular Chrome, ~35% of LUT-optimized native C+NEON.
The model has at most 8 effective bits of precision. Per-tensor int8 shows no measurable degradation.
Same WASM binary, different V8 builds = ~50% throughput delta. Chrome 145 vs Electron 41's Chromium 146.

micro-gpt across the abstraction stack

▶ Try it now (live WASM demo)

📊 Full report (interactive charts)

Headline findings