5-language ptfix ablation¶
A focused experiment to validate the Split bug fix on a small set before committing to a project-wide retrain.
Setup¶
- Languages: arabic + 4 CJK (Mandarin, Cantonese, Japanese, Korean)
- Seed: 2024 (new seed; existing main FFT used 42 + 1337)
- Config: paper-matched cfg per Whisper-ZS WER rule
- Tokenizer: post-Regex()-fix from
frankenstein_pretokenfix/ - Training data: FLEURS-train + CV-train combined IID (same as main FFT)
- Recipe: each lang's deployed save criterion + LR config (no per-lang recipe changes vs paper)
What this isolated¶
The point of this ablation was to test: does the tokenizer fix alone help, holding everything else fixed?
Specifically, the only differences from the paper's FFT main runs were:
- Tokenizer source:
frankenstein_pretokenfix/(Regex()-applied) instead offrankenstein/(broken) - New seed (2024)
Recipe, training data, eval methodology, and per-lang config all match the paper.
Results (FLEURS+CV combined test set, normalized)¶
| Lang | Paper FFT WER | ptfix WER | Δ WER | Paper FFT CER | ptfix CER | Δ CER |
|---|---|---|---|---|---|---|
| japanese | 106.94 | 91.87 | −15.07 ✓ | 50.49 | 40.22 | −10.27 ✓ |
| arabic | 30.38 | 42.57 | +12.19 | 8.96 | 15.39 | +6.43 |
| cantonese | 43.23 | 54.75 | +11.52 | 30.44 | 32.86 | +2.42 |
| korean | 40.54 | 45.69 | +5.15 | 17.11 | 20.42 | +3.31 |
| mandarin | 43.18 | 61.98 | +18.80 | 23.27 | 41.30 | +18.03 |
Japanese rescued: the documented "FFT generation failure" (WER > 100%) in the paper Appendix Table 3 footnote drops to WER 91.87%, CER 40.22%. The root cause was the tokenizer bug.
Other 4 langs regressed: this is a real cost of swapping just the tokenizer without changing anything else. With the new vocabulary IDs, the model effectively starts from a different point in embedding space and the existing recipe's hyperparameter choices don't translate one-to-one. We saw this pattern in our analysis: better tokenizer + unchanged recipe ≠ better final WER for these 4 langs.
What this tells us¶
- The tokenizer bug is real and Japanese-paper-limitation is fixable by just swapping in the corrected tokenizer.
- For the project-wide retrain to monotonically improve, the recipe needs alignment too — same idea as the team's prior rep-trap recipe work for the 7 fix-equipped templates.
The full 102-lang retrain (in progress) tests both fixes together: corrected tokenizer + per-lang recipe alignment to deployed code.
See also¶
- Tokenizers → Split bug and fix — the root cause
- 102-lang tokfix retrain — the full project-wide retrain