Tapasya — train a language model in your browser
Tapasya (Sanskrit: तपस्य — "disciplined practice") runs three training experiments on your text, entirely in this tab. Nothing is sent to a server. No account needed.
The algorithm
Training uses EGGROLL (arXiv:2511.16652) — evolution strategies, forward passes only, no backprop. A population of slightly-perturbed model copies are evaluated; the ones that produced lower loss pull the weights in their direction. Implemented in plain JS + WGSL shaders, no ML framework.
The three stages
Stage 1Raw bytes, 500 steps. Does the algorithm work at all?
Stage 2Same bytes, 2000 steps. What does more compute add?
Stage 3BPE sub-word tokens, 2000 steps. What does a better vocabulary add?
Your corpus
Paste any plain text — a novel chapter, Wikipedia article, your own writing. Around 10 KB is ideal. The model learns your text's patterns, nothing else.
Hardware
WebGPU acceleration is used automatically in Chrome, Edge, and Safari on a machine with a GPU. Falls back to CPU — training just takes longer.
Guided tour
Guided mode walks the stages in order with teaching copy explaining what each stage adds and why. Settings are locked to defaults so you can focus on watching the model learn.
Flow
- Paste your corpus, then click Start on Stage 1.
- Stage 2 unlocks automatically at 300 steps of Stage 1.
- Stage 3 unlocks at 1000 steps of Stage 2.
- Compare mode unlocks when any two stages have finished.
What you'll see
- Loss chart — the model's prediction error, going down over time. Lower is better.
- Samples — text the model generates every 10 steps. Watch it go from noise toward structure.
- Teaching copy — the blue strip explains what this stage's lever adds.
Resuming
Training is checkpointed to your browser's local storage every 50 steps. Close the tab, reopen it, paste the same corpus — you'll be offered to resume where you left off.
Free mode
All three stages are open from the start. Adjust any setting before clicking Start.
Settings
- Steps — how many training steps to run.
- Pop size — perturbed copies per step. Higher = better gradient estimate, slower per step.
- σ (sigma) — perturbation scale. Too high → unstable; too low → no signal. Default 0.05 works well.
- LR — how aggressively weights are adjusted from the fitness signal.
- Batch size — context windows evaluated per step. Higher = less noisy loss, slower.
- Seed — RNG seed. Same seed + same corpus → deterministic run.
Compare mode
After any two stages have completed, the Compare button in the top bar activates. It generates text from all three models side-by-side with the same seed prompt — the clearest way to see what each addition buys.
Resuming and resetting
Checkpoints are saved every 50 steps. Use Reset on a stage to clear its checkpoint and start over from scratch.
b2 — LoRA fine-tuning
b2 fine-tunes SmolLM-135M (135M parameters) on your documents using rank-8 LoRA adapters and the EGGROLL evolution-strategies algorithm — all in your browser, no backprop required.
Workflow
- Load model — downloads SmolLM-135M once and caches it in browser Cache Storage (~540 MB). Subsequent loads are instant.
- Add documents — drop .txt or .md files. They're tokenized, chunked to the sequence length, shuffled, and split 90/10 train/val.
- Start training — EGGROLL perturbs and evaluates LoRA adapters every step. The loss chart updates after each batch.
- Export — save adapters as a .safetensors file in PEFT format, loadable by Hugging Face transformers.
Settings
- Steps — total training steps to run.
- Batch — steps per UI update cycle. Smaller = more responsive Pause; larger = slightly faster overall.
- Pop size — perturbation copies per step. Higher → better gradient signal, slower per step.
- σ / LR — perturbation scale and learning rate. Defaults (0.03 / 0.001) are conservative for LoRA.
- LoRA rank / alpha — rank 8, alpha 16 are standard starting points.
- LoRA targets — which projection matrices get adapters. Default: q_proj, v_proj, down_proj.
- Seq len — token window length for each training chunk (default 512).
Notes
- Model download is ~540 MB and only happens once; subsequent opens use the browser cache.
- Training runs on WebGPU. Falls back to CPU — expect roughly 10× slower per step.
- Changing rank, alpha, or targets after loading resets adapters to zero on the next Start.