Interspeech 2026

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in S2ST

A Mixture-of-LoRA-Experts framework that preserves non-verbal vocalizations (laughter, crying, etc.) in speech-to-speech translation with remarkable data efficiency.

Anonymous Submission

76%
NV Preservation
~1K hr
Dataset Scale
30 min
Min. Fine-tune Data
7
Models Compared

Explore Our Work

Listen to our dataset samples and compare model outputs across different emotion categories.

Our Approach

MoVE tackles the expressiveness gap in S2ST through three key contributions.

Scalable Data Pipeline

An automated generation-selection pipeline that creates high-quality expressive S2ST training data, producing a ~1,000-hour bilingual corpus with 5 balanced emotion categories.

Mixture of LoRA Experts

Five specialized vocalization adapters (Angry, Happy, Sad, Laugh, Crying) with a learned soft-weighting router that dynamically blends experts for hybrid expressive states.

AudioLLM Fine-tuning

First to fine-tune general-purpose AudioLLMs for end-to-end S2ST, demonstrating that as little as 30 minutes of curated data achieves 95% of full-data emotional fidelity.

Key Findings

MoVE achieves state-of-the-art performance on emotion-preserving English–Chinese S2ST.

76%
NV Reproduction Rate
vs. ≤14% for existing S2ST systems
#1
Human Naturalness
Highest rated among all compared systems
95%
Data Efficiency
of full-data fidelity with just 30 min data

Cite Our Work

If you find our work useful, please consider citing:

@article{anonymous2026move, title = {MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation}, author = {Anonymous}, journal = {Interspeech 2026 (Under Review)}, year = {2026} }