microGPT across the abstraction stack

4,192-parameter transformer; same trained weights run on every layer of the stack from pure Python to FPGA. Hardware: Apple M4 Pro (24 GB, 10P + 4E cores, macOS 15.6.1, clang 17). Reference: TALOS-V2 on Cyclone V at 56 MHz. · overview · ▶ try the live WASM demo · source

TL;DR. For a 4K-parameter transformer, hardware speed is not the bottleneck — framework dispatch is. Pure Python is 8% of the FPGA. NumPy is 46%. MLX-GPU is 4% — slower than pure Python — because GPU launch overhead is the wrong shape for 4K MACs per token. WebAssembly in regular Chrome hits 25× the FPGA, ~35% of a LUT-optimized native C+NEON harness (note: the native harness precomputes the model's front half into lookup tables outside the timed loop; WASM does that work inside the loop, so this is "browser WASM vs LUT-optimized native," not strict apples-to-apples). C+NEON hits 72× the FPGA single-stream and 620× aggregate. The trained model is also surprisingly compressible: per-tensor int8 quantization gives no measurable degradation on a 500-name slice; the FPGA's Q4.12 carries headroom unused by this model's weight distribution (chosen for hardware reasons, not accuracy).

1 Single-stream throughput

Each implementation runs one autoregressive stream, batch=1, char-by-char with multinomial sampling. Token rate measured after warmup. Lower on this chart = slower. Log scale.

2 Multi-stream aggregate (C+NEON)

Independent autoregressive streams running on separate cores. M4 Pro has 10 P-cores + 4 E-cores; saturation visible around 14 streams.

3 The educational ladder — loss across six steps

Implementation	tok/sec	vs FPGA	What dominates
pure Python	4,332	0.08×	Python loop overhead per MAC
MLX fp32 (CPU)	3,873	0.07×	Per-op dispatch through the array framework
MLX fp32 (GPU)	1,865	0.04×	GPU kernel launch latency per op
NumPy fp32	24,223	0.46×	Reduce/dispatch/typecheck (~96% non-math)
TALOS-V2 (FPGA, 56 MHz)	53,000	1.00×	Reference baseline
WASM in browser (Chrome, M4 Pro)	1,341,206	25.30×	JIT-compiled WASM, ~35% of LUT-optimized native C+NEON; mean of 5 runs in regular Chrome (CV 0.18%). Electron-embedded Chromium shows ~2.04M; see `wasm/bench_runs.txt`
C+NEON Q4.12	2,191,219	41.34×	int16 multiply on NEON
C+NEON fp32	3,820,760	72.09×	Whole model + KV cache fits in L1

Same dataset (32K names from karpathy/makemore), each step adds exactly one concept on top of the last. NLL per character on the training corpus.

Step	What's added	NLL	Params
1. Counting bigram	Add-one smoothed frequency table; no learning	2.4546	729
2. Neural bigram	Manual gradient descent over a 27×27 logit matrix	2.4617	729
3. Autograd	Same model, but graph-based backprop replaces hand-derived gradients	2.4617	729
4. MLP	Embeddings, 3-character context, hidden layer with tanh	2.20	4,009
5. GPT, 1 head, SGD	Self-attention + RMSNorm + position embeddings; SGD plateaus	2.29	4,240
6. GPT, 4 heads, Adam	Multi-head attention + Adam; same architecture family as TALOS	2.21 train / 2.20 val	4,240

All steps trained from scratch on the same corpus. Step 6 uses a 90/10 split over unique names (the dataset has 32,033 rows but only 29,494 unique names; deduping before the split prevents leakage); the assertion in the code verifies the train and val name sets are disjoint. Val NLL ≤ train NLL, so no measurable overfitting. Step 6 has learnable RMSNorm gains and a final RMSNorm (4,240 params); the TALOS-trained reference used in the WASM benchmark has parameter-free RMSNorm and no final norm before the LM head (4,192 params). Same family, different parameter count and weights. Both land near NLL 2.2 — a model-capacity floor at this size, not a dataset floor (a larger model would dip below).

4 Quantization study — how much precision does the model actually need?

Trained TALOS-V2 weights, requantized to various Q-formats with per-tensor symmetric scaling (int8 / int4 / int2). NLL measured on the first 500 names of the training corpus. The model is at most 8-bit-effective — per-tensor int8 shows no measurable degradation on this slice (NLL drift below the sampling noise floor); Q4.12 (16-bit) carries headroom unused by this model's weight distribution. Note: this is a same-corpus eval, not a held-out set, so it bounds quantization "harm" but doesn't certify it.

5 Where does NumPy actually spend its time?

Format	max \|err\|	NLL	Sample quality
fp32 (baseline)	0.00000	2.2633	kana, keelan, alilan, ariel, cairi
Q4.12 strict (TALOS)	0.00012	2.2633	bit-identical
Q3.13 strict	0.00006	2.2633	bit-identical
per-tensor int8	0.00525	2.2631	identical samples
per-tensor int4	0.09499	2.3009	jaliny, mariel, calin (different but plausible)
per-tensor int2	0.65911	3.3499	broken rahftryckagnfern, fpvavsjdzjccwpvt

cProfile on bench_numpy.py, sorted by self-time. Only ~4% of wall-clock is in the actual matrix kernel (c_einsum). The other 96% is reduction setup, type checks, dispatcher overhead.

6 The story in one chart

The same forward pass in seven implementations, ranked by tok/sec relative to the FPGA.

Lessons

Hardware speed is not the bottleneck for a tiny model. Framework dispatch is. 4,192 multiply-accumulates per token finish in <1 microsecond on real silicon. NumPy spends 30 microseconds dispatching the matmul. MLX-GPU spends a millisecond launching the kernel. The math is irrelevant when the prologue costs more than the work.
FPGA's win isn't peak throughput — it's deterministic latency and watts. 53,000 tok/sec at 2W (Cyclone V) versus 3.8M tok/sec at ~5W per active core (M4). Per token: roughly the same energy. But the FPGA's per-token latency is exactly cycle-counted; the M4's varies with cache state, scheduler, thermals.
Quantization is for the substrate, not for accuracy. Per-tensor int8 produces no measurable degradation on the 500-name slice scored. Q4.12 carries unused headroom because Q4.12 maps cleanly onto a single FPGA DSP block; per-tensor int8 with runtime scale would need extra silicon. The choice of bit-width is constraint-driven, not error-driven.
WASM throughput depends on the V8 build, not just the binary. The same microgpt_inf.wasm on the same M4 Pro hardware runs at ~1.34M tok/sec in regular Chrome 145 and ~2.04M tok/sec in Electron 41's bundled Chromium 146 — a 50% delta from runtime alone. WebAssembly performance is a band per (binary, host) pair, not a single number per machine. The published headline (25.30× FPGA) uses the regular-Chrome figure because that is what most readers will see.
The educational progression matters more than the architecture. Steps 1–6 reach NLL 2.21 train / 2.20 val (deduped split). The published TALOS-trained reference reaches 2.26. The published microgpt-java reference reaches 2.37. They are all in the same band — that's a model-capacity floor at this scale (~4K params), not a dataset floor; larger models do go below 2.0 on this corpus. Knowing each component separately is the durable knowledge.