Whisper Accent

Conditioning via Adaptive Layer Normalization for Accent-Aware English Speech Recognition

Despite impressive multilingual performance, state-of-the-art ASR models like Whisper continue to exhibit elevated word error rates (WER) on non-native and regionally diverse English accents. Phonological variation across accents — differences in vowel quality, prosody, consonant realization, and rhythm — is systematic and structured. A model explicitly aware of speaker accent should be better equipped to attend to the relevant acoustic features.

We present Whisper-Accent: an extension of pretrained Whisper that conditions the decoder on learned accent embeddings via Adaptive Layer Normalization (AdaLN). The backbone encoder and decoder remain completely frozen, with only the AdaLN modulation weights, accent embeddings, and accent classifier trained from scratch. Whisper-Accent achieves 14.1% WER (whisper-accent-small.en) and 13.4% WER (whisper-accent-medium.en) compared to 17.6% and 17.5% for the respective Whisper baselines — absolute improvements of 3.5 and 4.1 percentage points.


Architecture

The core idea is simple: instead of fine-tuning the backbone, we inject accent-specific conditioning into the decoder’s normalization layers. Three lightweight components are added on top of a frozen Whisper checkpoint.

Figure 1: Whisper-Accent architecture. Accent embeddings are predicted from encoder hidden states via layer-weighted fusion and multi-head attention pooling, then used to modulate every frozen decoder LayerNorm via Adaptive Layer Normalization (AdaLN).

Accent Classifier. The encoder produces hidden states at every layer. We learn a set of scalar fusion weights over all \(L\) encoder layers plus the input embedding, yielding a single weighted-average representation of shape \((T, D)\). A linear projection reduces dimensionality, and multi-head attention pooling (MHA-pool) collapses the temporal axis using a learnable query vector. The resulting fixed-length vector is passed to a linear classification head over \(A\) accent classes.

Accent Embeddings. A lookup table of \(A\) trainable embedding vectors — one per accent class — maps a predicted accent label to a conditioning vector \(e \in \mathbb{R}^d\). Ground-truth labels are used during training; predicted labels from the classifier are used at inference, making the system fully self-contained.

Adaptive Layer Normalization. Every LayerNorm in the Whisper decoder is replaced by an AdaLN module:

\[\text{AdaLN}(h, e) = \big(1 + \gamma(e)\big) \odot \text{LayerNorm}(h) + \beta(e)\]

where \(\gamma(\cdot)\) and \(\beta(\cdot)\) are learned linear projections from the accent embedding. Projection weights are zero-initialized (following ControlNet (missing reference)), so AdaLN has no effect at the start of training — providing a stable, non-destructive initialization. Bias projections are initialized to the pretrained LayerNorm \(\gamma\) / \(\beta\) values and frozen.


Two-Stage Training

Training decouples accent classification from ASR conditioning to avoid conflicting gradient signals.

Stage 1 — Accent Classifier. The full Whisper backbone is frozen. Only the layer-fusion weights, projection, MHA pooling, and classification head are trained under pure accent cross-entropy loss (\(\lambda_\text{CE} = 0\), \(\lambda_\text{accent} = 1\)). A learning rate of 1e-3 is used with class weighting to handle label imbalance.

Stage 2 — Decoder AdaLN + Accent Embeddings. The Stage 1 checkpoint is loaded; everything except the AdaLN modulation parameters and the accent embedding table is frozen. The model is trained under pure ASR cross-entropy (\(\lambda_\text{CE} = 1\), \(\lambda_\text{accent} = 0\)) conditioned on ground-truth accent labels. A primary learning rate of 5e-5 applies to AdaLN parameters; a separate embedding learning rate of 5e-4 applies to accent embeddings. Weight decay is disabled, consistent with zero-initialized AdaLN weights.


Results

Comparison with Whisper Baselines

Whisper-Accent consistently outperforms its size-matched baselines — and even the much larger whisper-large-v3 — demonstrating that targeted accent conditioning is a more effective lever than raw model scale.

Model Overall WER ↓
openai/whisper-small.en 17.6%
openai/whisper-medium.en 17.5%
openai/whisper-large-v3 17.7%
openai/whisper-large-v3-turbo 20.1%
mavleo96/whisper-accent-small.en 14.1% (+3.5%)
mavleo96/whisper-accent-medium.en 13.4% (+4.1%)
Table 1: Overall WER on the English Accent Dataset test split. Our models improve over all Whisper baselines including large-v3, despite being significantly smaller.

Per-Accent WER and Accent Classification Accuracy

Improvements are observed across all 23 accent classes. The largest absolute reductions occur for accents with the highest baseline WER (Vietnamese, Spanish, French, Italian), where phonological distance from standard American English is greatest. The accent classifier achieves 85.1% accuracy on whisper-accent-small.en and 95.7% on whisper-accent-medium.en.

Accent Whisper-small WER Accent-small WER Accent Acc.
American 14.2% 11.8% 91.3%
British 16.3% 13.1% 87.6%
Indian 22.1% 17.4% 88.2%
Spanish 25.4% 19.7% 82.1%
German 21.8% 17.0% 84.5%
French 24.6% 19.2% 81.7%
Scottish 19.3% 15.6% 88.9%
Dutch 20.5% 16.3% 83.4%
Irish 18.7% 15.0% 86.2%
Vietnamese 28.1% 22.3% 79.4%
Canadian 14.9% 12.2% 90.1%
Polish 23.2% 18.5% 80.8%
Table 2: Per-accent WER and classifier accuracy for a subset of accents (whisper-accent-small.en). Full 23-accent results in the repository.

Ablation: Ground-Truth vs. Predicted vs. Random Accent Labels

To isolate the contribution of accurate accent classification, we compare three conditioning modes at evaluation time.

Conditioning WER (small) WER (medium)
Ground-truth accent label 13.6% 12.9%
Predicted accent label 14.1% 13.4%
Random accent label 17.4% 17.2%
Table 3: WER under different accent conditioning strategies. Bold row is the operational (deployment) setting.

Three findings stand out. First, random conditioning degrades performance to near-baseline Whisper WER, confirming that the gains are attributable to the specificity of the predicted accent label rather than a generic regularization effect. Second, the gap between predicted and ground-truth conditioning is only 0.5 pp, validating the end-to-end utility of the Stage 1 classifier. Third, the remaining gap to the ground-truth ceiling suggests that further improvements to accent classification could yield additional WER reductions.


Accent Embedding Analysis

Figure 2: Left — cosine similarity matrix of the 23 learned accent embeddings (whisper-accent-medium.en). Right — UMAP projection color-coded by broad linguistic/geographic region. Phonologically proximate accent families cluster naturally without explicit supervision.

The cosine similarity heatmap reveals meaningful structure: Germanic accents (German, Dutch) are most similar to each other, as are the native British Isles varieties (British, Irish, Scottish, Northern Irish). Slavic accents (Czech, Polish, Slovak, Croatian, Slovene) form a coherent cluster in the UMAP projection. Vietnamese stands apart from all European accents, consistent with its typological distance.

These geometries emerge from training on WER loss alone — the model was never explicitly supervised to cluster accents by linguistic family. This suggests that the AdaLN conditioning pressure encourages the accent embeddings to internalize phonological proximity as a useful organizational principle for decoder modulation.


Discussion

Why AdaLN works for accent conditioning. LayerNorm controls the scale and mean of internal activations and has been shown to mediate significant representational control in generative models (missing reference). By conditioning these normalization statistics on a predicted accent label, AdaLN enables the model to softly “tune” the decoder’s representational manifold to the expected phonological properties of each accent, without rewriting attention pattern weights.

Generalization preserved. Because the encoder and decoder weights are entirely frozen, the model retains Whisper’s original WER on speech from accent classes not represented in training. The worst-case behavior under distribution shift is to predict an incorrect accent label and condition with the wrong embedding — as Table 3 shows, this still produces near-baseline Whisper performance rather than degrading below it.

Limitations. The current approach requires explicit accent labels in the training corpus. Additionally, the 23 accent classes are coarse categories; within-class variation (e.g., regional varieties of Indian English) is not captured. Future work could explore a continuous accent embedding space trained via contrastive objectives rather than discrete classification.


Checkpoints & Code

Pretrained checkpoints are available on the Hugging Face Hub:

Training and evaluation code: github.com/mavleo96/whisper-accent