microGPT across the abstraction stack

4,192-parameter transformer; same trained weights run on every layer of the stack from pure Python to FPGA. Hardware: Apple M4 Pro (24 GB, 10P + 4E cores, macOS 15.6.1, clang 17). Reference: TALOS-V2 on Cyclone V at 56 MHz. · overview · ▶ try the live WASM demo · source

TL;DR. For a 4K-parameter transformer, hardware speed is not the bottleneck — framework dispatch is. Pure Python is 8% of the FPGA. NumPy is 46%. MLX-GPU is 4% — slower than pure Python — because GPU launch overhead is the wrong shape for 4K MACs per token. WebAssembly in regular Chrome hits 25× the FPGA, ~35% of a LUT-optimized native C+NEON harness (note: the native harness precomputes the model's front half into lookup tables outside the timed loop; WASM does that work inside the loop, so this is "browser WASM vs LUT-optimized native," not strict apples-to-apples). C+NEON hits 72× the FPGA single-stream and 620× aggregate. The trained model is also surprisingly compressible: per-tensor int8 quantization gives no measurable degradation on a 500-name slice; the FPGA's Q4.12 carries headroom unused by this model's weight distribution (chosen for hardware reasons, not accuracy).

1   Single-stream throughput

Each implementation runs one autoregressive stream, batch=1, char-by-char with multinomial sampling. Token rate measured after warmup. Lower on this chart = slower. Log scale.

Implementationtok/secvs FPGAWhat dominates
pure Python4,3320.08×Python loop overhead per MAC
MLX fp32 (CPU)3,8730.07×Per-op dispatch through the array framework
MLX fp32 (GPU)1,8650.04×GPU kernel launch latency per op
NumPy fp3224,2230.46×Reduce/dispatch/typecheck (~96% non-math)
TALOS-V2 (FPGA, 56 MHz)53,0001.00×Reference baseline
WASM in browser (Chrome, M4 Pro)1,341,20625.30×JIT-compiled WASM, ~35% of LUT-optimized native C+NEON; mean of 5 runs in regular Chrome (CV 0.18%). Electron-embedded Chromium shows ~2.04M; see wasm/bench_runs.txt
C+NEON Q4.122,191,21941.34×int16 multiply on NEON
C+NEON fp323,820,76072.09×Whole model + KV cache fits in L1

2   Multi-stream aggregate (C+NEON)

Independent autoregressive streams running on separate cores. M4 Pro has 10 P-cores + 4 E-cores; saturation visible around 14 streams.

3   The educational ladder — loss across six steps

Same dataset (32K names from karpathy/makemore), each step adds exactly one concept on top of the last. NLL per character on the training corpus.

StepWhat's addedNLLParams
1. Counting bigramAdd-one smoothed frequency table; no learning2.4546729
2. Neural bigramManual gradient descent over a 27×27 logit matrix2.4617729
3. AutogradSame model, but graph-based backprop replaces hand-derived gradients2.4617729
4. MLPEmbeddings, 3-character context, hidden layer with tanh2.204,009
5. GPT, 1 head, SGDSelf-attention + RMSNorm + position embeddings; SGD plateaus2.294,240
6. GPT, 4 heads, AdamMulti-head attention + Adam; same architecture family as TALOS2.21 train / 2.20 val4,240

All steps trained from scratch on the same corpus. Step 6 uses a 90/10 split over unique names (the dataset has 32,033 rows but only 29,494 unique names; deduping before the split prevents leakage); the assertion in the code verifies the train and val name sets are disjoint. Val NLL ≤ train NLL, so no measurable overfitting. Step 6 has learnable RMSNorm gains and a final RMSNorm (4,240 params); the TALOS-trained reference used in the WASM benchmark has parameter-free RMSNorm and no final norm before the LM head (4,192 params). Same family, different parameter count and weights. Both land near NLL 2.2 — a model-capacity floor at this size, not a dataset floor (a larger model would dip below).

4   Quantization study — how much precision does the model actually need?

Trained TALOS-V2 weights, requantized to various Q-formats with per-tensor symmetric scaling (int8 / int4 / int2). NLL measured on the first 500 names of the training corpus. The model is at most 8-bit-effective — per-tensor int8 shows no measurable degradation on this slice (NLL drift below the sampling noise floor); Q4.12 (16-bit) carries headroom unused by this model's weight distribution. Note: this is a same-corpus eval, not a held-out set, so it bounds quantization "harm" but doesn't certify it.

Formatmax |err|NLLSample quality
fp32 (baseline)0.000002.2633kana, keelan, alilan, ariel, cairi
Q4.12 strict (TALOS)0.000122.2633bit-identical
Q3.13 strict0.000062.2633bit-identical
per-tensor int80.005252.2631identical samples
per-tensor int40.094992.3009jaliny, mariel, calin (different but plausible)
per-tensor int20.659113.3499broken rahftryckagnfern, fpvavsjdzjccwpvt

5   Where does NumPy actually spend its time?

cProfile on bench_numpy.py, sorted by self-time. Only ~4% of wall-clock is in the actual matrix kernel (c_einsum). The other 96% is reduction setup, type checks, dispatcher overhead.

6   The story in one chart

The same forward pass in seven implementations, ranked by tok/sec relative to the FPGA.

Lessons

  1. Hardware speed is not the bottleneck for a tiny model. Framework dispatch is. 4,192 multiply-accumulates per token finish in <1 microsecond on real silicon. NumPy spends 30 microseconds dispatching the matmul. MLX-GPU spends a millisecond launching the kernel. The math is irrelevant when the prologue costs more than the work.
  2. FPGA's win isn't peak throughput — it's deterministic latency and watts. 53,000 tok/sec at 2W (Cyclone V) versus 3.8M tok/sec at ~5W per active core (M4). Per token: roughly the same energy. But the FPGA's per-token latency is exactly cycle-counted; the M4's varies with cache state, scheduler, thermals.
  3. Quantization is for the substrate, not for accuracy. Per-tensor int8 produces no measurable degradation on the 500-name slice scored. Q4.12 carries unused headroom because Q4.12 maps cleanly onto a single FPGA DSP block; per-tensor int8 with runtime scale would need extra silicon. The choice of bit-width is constraint-driven, not error-driven.
  4. WASM throughput depends on the V8 build, not just the binary. The same microgpt_inf.wasm on the same M4 Pro hardware runs at ~1.34M tok/sec in regular Chrome 145 and ~2.04M tok/sec in Electron 41's bundled Chromium 146 — a 50% delta from runtime alone. WebAssembly performance is a band per (binary, host) pair, not a single number per machine. The published headline (25.30× FPGA) uses the regular-Chrome figure because that is what most readers will see.
  5. The educational progression matters more than the architecture. Steps 1–6 reach NLL 2.21 train / 2.20 val (deduped split). The published TALOS-trained reference reaches 2.26. The published microgpt-java reference reaches 2.37. They are all in the same band — that's a model-capacity floor at this scale (~4K params), not a dataset floor; larger models do go below 2.0 on this corpus. Knowing each component separately is the durable knowledge.