Skip to content

BuzzASR

File map

vamsin07/buzzasr-docs

File map¶

Quick reference: where each piece of the codebase lives.

Top-level¶

/mnt/ssd-3/asr/
├── _repro_sync/                       # main codebase
├── _pod_share/                        # shared across pods
└── (no other top-level dirs)

`_repro_sync/`¶

_repro_sync/
├── train_fft.py                       # FFT orchestrator (CLI entry)
├── train_vft.py                       # VFT/SFT orchestrator (CLI entry)
├── dispatcher.py                      # multi-GPU job scheduler
├── eval_strategy_c_test_combined.py   # FFT test eval
├── eval_vft_test.py                   # VFT/SFT test eval
├── run_test_eval.py                   # batch test eval driver
├── matrix.json                        # experiment manifest
├── matrix_status.json                 # per-job state
├── matrix_tokfix_only.json            # filtered subset (tokenizer-fix retrain)
│
├── scaling/
│   ├── train_bpes_bulk.py             # train all 102 per-lang tokenizers
│   └── train_bpe_nopretoken.py        # ablation: BPE with no Split pre-tokenizer
│
├── results/
│   ├── tidy_results.csv               # final results table
│   ├── matrix_test_fft_*.json         # per-lang FFT test eval
│   └── ptfix/                         # 5-lang ablation results
│
├── logs/
│   ├── matrix/<job_id>.log            # per-job training log
│   └── dispatcher_<pod>.log           # per-dispatcher log
│
└── run_state/
    ├── claims.lock                    # fcntl lock file for dispatcher claims
    └── fft_scripts/                   # generated per-job training scripts
        └── train_fft_<lang>_cfg<X>_s<N>_main.py

`_pod_share/`¶

_pod_share/
├── adapt-env/                         # Python venv (torch, transformers, etc.)
└── jobs/                              # per-lang FFT training templates
    └── train_strategy_c_<lang>.py     # one per language

Checkpoints¶

/mnt/ssd-3/checkpoints/
├── frankenstein/                      # OLD per-lang BPE tokenizers (broken Split)
│   └── <lang>/tokenizer/
│       ├── tokenizer.json
│       ├── vocab.json
│       └── merges.txt
│
├── frankenstein_fix/                  # NEW per-lang BPE tokenizers (Regex() fixed)
│   └── <lang>/tokenizer/...
│
└── matrix_runs/                       # all fine-tuned model checkpoints
    └── <job_id>/
        ├── best/checkpoint.pt
        └── latest/checkpoint.pt

Data¶

/mnt/ssd-3/asr/
├── fleurs/                            # FLEURS audio + transcripts
│   └── data/<fleurs_code>/
│
├── cv/cv-corpus-25.0-2026-03-09/      # CommonVoice v25
│   └── <cv_code>/clips/ + test.tsv
│
├── training_sets/<fleurs_code>/       # FLEURS+CV combined IID (pre-built)
│
└── text_pretraining/goldfish/         # Per-lang text corpora (for tokenizer + text-MTL)
    └── <iso3>_<script>.txt

See also¶

Pipeline → Overview — how these pieces connect at runtime
Reference → Glossary — what the file/dir names mean