Dispatcher + matrix¶

The orchestration layer for running many fine-tunes in parallel across multiple GPUs and pods.

`matrix.json` — the experiment manifest¶

A single JSON file listing every fine-tune job we want to run. One entry per (lang × cfg × seed × method) combination. Schema:

{
    "job_id":         "fft_spanish_cfgB_s42_main",
    "method":         "fft",                    # or "vft", "vftmtl", "fft_strat_a", ...
    "lang":           "spanish",
    "fleurs_code":    "es_419",
    "whisper_lang":   "spanish",
    "config_id":      "B",                      # cfg A/B/C/D — see Recipes
    "seed":           42,
    "asr_ratio":      0.5,
    "max_text_lines": 500000,
    "text_mtl_path":  "/path/to/goldfish/spa_latn.txt",
    "output_dir":     "/checkpoints/matrix_runs/fft_spanish_cfgB_s42_main",
    "run_kind":       "main",                   # or "textmix_sweep", "scaling", ...
}

Across the project this grew to ~1200 entries (main FFT + textmix sweep + replication runs).

`matrix_status.json` — the running state¶

A separate JSON keyed by job_id:

{
    "fft_spanish_cfgB_s42_main": {
        "state":       "done",                  # or "queued", "running", "failed", "abandoned"
        "pod":         "shivam-0_tokfix",
        "pid":         3328333,
        "gpu":         2,
        "started_at":  "2026-06-17T22:48:00Z",
        "finished_at": "2026-06-17T26:12:00Z"
    },
    ...
}

The dispatcher only touches its own claimed entries — but reads the whole file every poll to know what's queued.

How claims work¶

Both pods can run a dispatcher concurrently. Coordination is via fcntl.flock:

import fcntl, json

def claim_job():
    with open("/mnt/ssd-3/asr/_repro_sync/run_state/claims.lock", "w") as f:
        fcntl.flock(f, fcntl.LOCK_EX)            # exclusive lock; blocks until acquired
        status = json.load(open("matrix_status.json"))
        # Find a "queued" entry, atomically flip it to "running"
        for job_id, st in status.items():
            if st["state"] == "queued":
                st["state"] = "running"
                st["pod"] = pod_name
                st["pid"] = os.getpid()
                json.dump(status, open("matrix_status.json", "w"))
                return job_id
        return None
    # lock auto-released on context exit

This works because the file is on the shared NFS mount and both pods see it. There's no race condition: only one pod can hold the lock at a time.

Dispatcher invocation¶

python dispatcher.py \
    --pod shivam-0_tokfix \
    --gpus 0,1,2,3,4,5,6,7 \
    --methods fft \
    --matrix matrix_tokfix_only.json \
    --poll-interval 30

Args:

Arg	Meaning
`--pod`	Pod identifier (just a label for `started_by` tracking)
`--gpus`	Which GPU IDs to use (claims one job per GPU)
`--methods`	Filter on `job["method"]` — only claim jobs of these types
`--matrix`	Filtered subset of `matrix.json` (useful for restricting to a specific run kind)
`--poll-interval`	Seconds between polls (default 30)

Per-job process tree¶

When the dispatcher claims a job and spawns it:

dispatcher.py (parent)
└── bash -c "CUDA_VISIBLE_DEVICES=2 python train_fft.py --lang spanish ..."
    └── train_fft.py (patches the template)
        └── python run_state/fft_scripts/train_fft_spanish_cfgB_s42_main.py (the real training)

The dispatcher records the bash child PID in matrix_status.json so the job can be killed from outside (via kill -TERM <pid>). The actual training subprocess (the leaf node) is not directly tracked — but killing the bash parent propagates SIGTERM down the tree.

What happens on failure¶

If the leaf training process exits non-zero:

train_fft.py propagates the exit code up
dispatcher.py sees the bash child exited with nonzero
matrix_status.json flipped to state="failed"
Dispatcher moves on to next queued job

There is no automatic retry. Re-queueing is manual:

import json
s = json.load(open("matrix_status.json"))
for job_id in [...]:
    s[job_id]["state"] = "queued"
json.dump(s, open("matrix_status.json", "w"))

Why no real scheduler¶

We considered Celery, Slurm, Ray. For ~1200 jobs on 16 GPUs across 2 pods with shared NFS, the file-locked dispatcher is genuinely simpler:

No external dependency
No broker process to crash
Status JSON is human-readable and editable
Failure recovery is "edit the file"

The cost is no automatic retries, no priority queues, no fancy scheduling. Acceptable for a research workflow where you re-queue failures by hand after diagnosing them.