Audit methodology¶

How we measure tokenizer quality and detect failure modes.

Compression rate: chars/token on FLEURS-test¶

from tokenizers import Tokenizer

def chars_per_token(tokenizer_path, sentences):
    tok = Tokenizer.from_file(str(tokenizer_path))
    total_chars = total_tokens = 0
    for s in sentences:
        enc = tok.encode(s)
        total_chars += len(s)
        total_tokens += len(enc.ids)
    return total_chars / total_tokens

Higher is better (fewer tokens per char of text → shorter decoder sequences → faster inference).

Note: chars/token computed on the training corpus and chars/token computed on a held-out corpus (e.g. FLEURS-test) should be similar. A large divergence indicates the tokenizer has memorized training-specific multi-character tokens.

Cross-boundary merge audit¶

Each BPE merge produces a token. In a properly pre-tokenized pipeline, no token should span multiple pre-token boundaries. A pre-token in a byte-level recipe has one of these structures:

Ġword (Ġ + letters)
word (letters only)
Ġnumbers (Ġ + 1–3 digits)
Ġpunct (Ġ + punctuation)
etc.

A cross-boundary token violates this structure. For example:

ĠwordĠword — contains two Ġs, so it spans two word boundaries
wordpunct — contains both letters and punctuation, with no Ġ between
word\nword — contains an embedded newline

def is_cross_boundary(token_str):
    # Multiple Ġs = spans multiple words
    if token_str.count('Ġ') > 1: return True
    # Embedded newline
    if any(c in token_str for c in '\r\n'): return True
    # Ġ at non-zero position
    if 'Ġ' in token_str and not token_str.startswith('Ġ'): return True
    # Mixed character classes in the body
    body = token_str.lstrip('Ġ')
    classes = {classify(c) for c in body}    # 'L' (letter), 'N' (digit), 'P' (punct)
    return len(classes) > 1

A healthy tokenizer should have <5% cross-boundary merges. A broken one (like our pre-fix tokenizers) can have 60–98%.

Per-script grouping¶

When reporting medians, we group langs by script family because the impact of pre-tokenization is highly script-dependent:

CJK (Mandarin, Cantonese, Japanese, Korean): char-isolation is critical; chars/token floor is ~1.0
Abugida / Brahmic (Hindi, Bengali, Tamil, Khmer, Lao, …): char-isolation gives biggest wins
Arabic-script (Arabic, Urdu, Persian, …): word-level pre-tokenization via \p{L}+; chars/token ~4 is healthy
Hebrew / Armenian / Georgian / Ethiopic: mixed isolation + word-level
Cyrillic: word-level; chars/token ~4–5
Latin: word-level; chars/token ~5–6

Verification chain (for the Regex() fix)¶

To verify the fix landed correctly:

Train tokenizer with Split(pattern=Regex(SPLIT_PATTERN), ...)
For each lang, compute chars/token on FLEURS-test
Compare to pre-fix numbers — should be substantially lower for non-Latin scripts
Sample tokens from FLEURS-test encoding and inspect: no multi-word phrases, CJK tokens are mostly 1 char each

For details on what the fix delivered, see Split bug and fix.

Audit methodology¶

Compression rate: chars/token on FLEURS-test¶

Cross-boundary merge audit¶

Per-script grouping¶

Verification chain (for the Regex() fix)¶

See also¶