Skip to content

Running a single FFT job

How to run one full fine-tune end-to-end without the dispatcher / matrix infrastructure.

Direct invocation

python train_fft.py \
  --lang spanish \
  --fleurs-code es_419 \
  --whisper-lang spanish \
  --config-id B \
  --seed 1337 \
  --asr-ratio 0.5 \
  --max-text-lines 500000 \
  --text-mtl-path /path/to/text_pretraining/goldfish/spa_latn.txt \
  --output-dir runs/spanish-cfgB-s1337 \
  --init-strategy C

What each flag means:

Flag Meaning
--lang Lang name (must match a train_strategy_c_<lang>.py template)
--fleurs-code FLEURS dataset code (e.g. es_419)
--whisper-lang Whisper's language token name (e.g. spanish)
--config-id Which LR config to use (A=aggressive, B=conservative, C=embed-heavy, D=middle)
--seed Random seed
--asr-ratio Fraction of batches that are ASR (vs. text-MTL)
--max-text-lines Cap on text-MTL data
--text-mtl-path Path to per-lang goldfish corpus
--output-dir Where to save best/ and latest/ checkpoints
--init-strategy C for Strategy C (warm-start embedding init)

What happens under the hood

train_fft.py reads the per-lang template at _pod_share/jobs/train_strategy_c_<lang>.py, patches the LR mults / seed / output dir / etc. into it, writes the patched script to run_state/fft_scripts/, then exec's it. See Pipeline → Per-job script generation for the patching details.

Expected runtime

  • A100 80 GB: ~3–4 hours for cfg B (24 grad_accum), ~2.5 hours for cfg A (16 grad_accum)
  • H100 80 GB: ~1.5× faster than A100

Outputs

After training:

runs/spanish-cfgB-s1337/
├── best/
│   ├── checkpoint.pt         # ~6 GB FP32 state dict
│   └── training_config.json  # the resolved cfg + best metrics
└── latest/
    └── checkpoint.pt         # periodic backup (every 1000 steps)

Test eval

After training, evaluate on FLEURS-test + CV25-test:

python eval_strategy_c_test_combined.py \
  --lang spanish \
  --fleurs es_419 \
  --whisper-lang spanish \
  --cv-code es \
  --ckpt-path runs/spanish-cfgB-s1337/best/checkpoint.pt \
  --results-path results/test_spanish_cfgB_s1337.json

Output test_spanish_cfgB_s1337.json will have fleurs_test, cv25_test, and combined blocks with raw + normalized WER/CER.