Home About Projects Writings

"Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM"

Technical Use Case for : Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Jun 18th 2025
The paper "Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM" introduces several practical techniques for improving pseudo-labeling in Automatic Speech Recognition (ASR) systems, which can be highly useful in real-world work environments. Here are the key tricks and their applications:

1. Multi-ASR Fusion for Robust Pseudo-Labeling
   - Trick : Combine outputs from multiple ASR models (e.g., Whisper, Wav2Vec2, Conformer) to generate more accurate pseudo-labels.  
   - Real-world Use :  
     - Reduces dependency on a single ASR model’s biases/errors.  
     - Helps in low-resource or noisy environments where no single model performs optimally.  
     - Implementation : Use majority voting, confidence-weighted averaging, or ROVER (Recognizer Output Voting Error Reduction).  

2. Leveraging SpeechLLM for Error Correction  
   - Trick : Fine-tune a Speech-adapted Large Language Model (e.g., LLAMA, GPT-style models) to correct ASR errors in pseudo-labels.  
   - Real-world Use :  
   - Fixes common ASR mistakes (homophones, domain-specific terms, speaker variations).  
     - Improves transcript quality without manual relabeling.  
     - Implementation :  
     - Train LLM on ASR errors (e.g., using N-best lists or human corrections).  
       - Use constrained decoding to avoid hallucination.  

3. Confidence-Based Filtering & Iterative Refinement**  
   - Trick : Use model confidence scores to filter out unreliable pseudo-labels before training.  
   - Real-world Use :  
    - Prevents noisy labels from degrading model performance.  
     - Enables semi-supervised learning with minimal human verification.  
     - Implementation :  
     - Threshold-based filtering (e.g., keep only tokens with >90% confidence).  
       - Iterative self-training: Retrain ASR on high-confidence labels, then expand.  

4. Domain-Adaptive Pseudo-Labeling**  
   - Trick : Use in-domain data (even unlabeled) to fine-tune ASR models before pseudo-labeling.  
   - Real-world Use :  
   - Improves accuracy in specialized domains (medical, legal, technical).  
     - Reduces mismatch between training and deployment conditions.  

5_. N-Best Rescoring with LLMs**  
   - Trick : Use an LLM to rerank multiple ASR hypotheses (N-best lists) and select the best one.  
   - Real-world Use :  
   - Better than single-model decoding in ambiguous cases.  
     - Helpful for rare words, accents, and noisy speech.  

Practical Deployment Tips
- Start with strong open-source ASRs (Whisper, NVIDIA NeMo) for fusion.  
- Use lightweight LLMs (e.g., DistilBERT, TinyLLaMA) for cost-efficient correction.  
- Deploy confidence thresholds dynamically (tighter for critical tasks).  

These tricks can significantly cut annotation costs while improving ASR robustness in production. Would you like specific 

Great! Below are *detailed implementation steps* for applying the tricks from "Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM" in a real-world environment.  

---

1. Multi-ASR Fusion*
Goal : Combine outputs from multiple ASR models to generate higher-quality pseudo-labels.  

mplementation Steps :  
1. Select Diverse ASR Models :  
   - Use models with different architectures (e.g., Whisper, Wav2Vec2, NVIDIA Conformer).  
   - Example :  

     import whisper, torch
     from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

     # Load multiple ASR models
     whisper_model = whisper.load_model("medium")
     wav2vec2_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
     wav2vec2_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
     

2. Generate Hypotheses :  
   - Run inference on all models and collect N-best lists (top candidate transcriptions).  

3. Fusion Strategies :  
   - ROVER (Voting-based) : Align hypotheses and pick the most frequent tokens.  
     
     from asr_metrics import rover

     hypotheses = [whisper_transcript, wav2vec2_transcript, ...]
     best_transcript = rover(hypotheses)  # Use ROVER library
     
   - Confidence-weighted Averaging : Use model confidence scores to weigh predictions.  
   - LLM-based Reranking : Use a small LLM (e.g., DistilBERT) to pick the best hypothesis.  

When to Use :  
- When working with noisy/unseen data* where no single ASR dominates.  
- For *low-resource languages*, combining models improves coverage.  

---

2. SpeechLLM for Error Correction*
Goal : Fix ASR errors in pseudo-labels using a fine-tuned LLM.  

Implementation Steps :  
1. Prepare Training Data :  
   - Collect ASR errors (e.g., Whisper outputs vs. human corrections).  
   - Format: {"asr_transcript": "play the song", "corrected": "play some song"}  

2.  Fine-tune a Lightweight LLM :  
   - Use *Llama-2-7B, Mistral, or DistilBERT* for efficiency.  
   - Example (HuggingFace):  
     
     from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

     model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
     tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

     Train on ASR correction pairs (fine-tuning code depends on framework)
     

3. Inference with Constrained Decoding :  
   - Force the LLM to stay close to the original ASR output to avoid hallucinations.  
   - Use *beam search + length normalization* for better coherence.  

 When to Use :  
- For *domain-specific terms* (e.g., medical, legal jargon).  
- When *post-editing ASR outputs* at scale.  

---

3. Confidence-Based Filtering  
Goal : Keep only high-confidence pseudo-labels for training.  

Implementation Steps :  
1. Extract Token/Word-Level Confidence :  
   - Whisper: Use model.transcribe(..., word_timestamps=True) for word-level confidences.  
   - Wav2Vec2: Use softmax probabilities from the CTC head.  

2. Apply Thresholding :  
 
   def filter_low_confidence(transcript, min_confidence=0.8):
       filtered_words = [word for word in transcript.words if word.confidence >= min_confidence]
       return " ".join(filtered_words)
   

3. Iterative Self-Training :  
   - Train initial model → Generate pseudo-labels → Filter → Retrain.  

 When to Use :  
- For *semi-supervised learning* (limited labeled data).  
- When *noise robustness* is critical.  

---

4. N-Best Rescoring with LLMs  
Goal : Pick the best ASR hypothesis from multiple candidates.  

Implementation Steps :  
1. Extract N-Best Lists :  
   - Whisper: Use beam_size=5 to get multiple hypotheses.  
   - Kaldi-style ASRs: Use lattice rescoring.  

2. LLM Reranking :  
   - Fine-tune a small LLM to score hypotheses based on:  
     - Fluency (perplexity).  
     - Semantic correctness (domain-specific scoring).  


   from transformers import pipeline

   scorer = pipeline("text-classification", model="distilbert-base-uncased")  # Example
   best_hypothesis = max(hypotheses, key=lambda x: scorer(x)["score"])
   

When to Use :  
- For ambiguous speech (e.g., accents, background noise).  

---

5. Domain Adaptation
Goal : Improve pseudo-labeling for specialized domains.  

Implementation Steps :  
1. *Fine-tune ASR on In-Domain Data :  
   - Even unlabeled data  helps (via pseudo-labeling loop).  
   - Example for Whisper:  
   
     whisper_model = whisper.load_model("small")
     whisper_model.finetune(custom_dataset)  # Requires Whisper fine-tuning setup
     

2. Use Domain-Specific LM for Correction :  
   - Train LLM on medical/legal/financial text for better corrections.  

 When to Use :  
- For niche applications  (e.g., call centers, radiology reports).