Reproducing paper results¶

End-to-end workflow to reproduce the BuzzASR Table 5/6 numbers.

Placeholder — fill in once code-release path is finalized

This page should walk through cloning the repo, downloading data, training, and evaluating.

High-level steps¶

Set up environment — see Installation
Download FLEURS — pulled lazily via datasets.load_dataset("google/fleurs", lang_code)
Download CommonVoice v25 — from https://commonvoice.mozilla.org/datasets, capped at 90h per lang
Download goldfish corpora — for the 102 langs we use
Train per-lang tokenizers — python scaling/train_bpes_bulk.py (~10 min on CPU)
Run all 102 FFT jobs — python dispatcher.py --methods fft --gpus 0-7 (~3-5 days on 8 GPUs)
Run all SFT jobs — python dispatcher.py --methods vft --gpus 0-7 (~half the FFT time)
Test eval all checkpoints — python run_test_eval.py --gpus 0-7 (~half a day)
Aggregate to tidy_results.csv — python results/build_tidy_results.py
Generate plots — see notebook (TBD link)

Step	Wall-clock
Tokenizer training (102 langs)	~10 min
FFT main runs (102 langs × 2 seeds = 204)	~3-5 days
SFT main runs (102 langs × 2 configs = 204)	~1.5-2 days
Test eval (all ckpts)	~12 hours
External baseline runs (Cohere, Qwen3, MMS, Omni)	varies by API rate-limits