A Mixture-of-LoRA-Experts framework that preserves non-verbal vocalizations (laughter, crying, etc.) in speech-to-speech translation with remarkable data efficiency.
Listen to our dataset samples and compare model outputs across different emotion categories.
Browse ~1,000 hours of bilingual (Chinese ↔ English) expressive speech pairs across 5 emotion categories with interactive waveform visualization.
Explore Dataset →Listen and compare speech outputs from 7 S2ST models side-by-side: OpenAI, SeamlessM4T, Kimi, and our MoVE across 6 emotion categories.
Compare Models →MoVE tackles the expressiveness gap in S2ST through three key contributions.
An automated generation-selection pipeline that creates high-quality expressive S2ST training data, producing a ~1,000-hour bilingual corpus with 5 balanced emotion categories.
Five specialized vocalization adapters (Angry, Happy, Sad, Laugh, Crying) with a learned soft-weighting router that dynamically blends experts for hybrid expressive states.
First to fine-tune general-purpose AudioLLMs for end-to-end S2ST, demonstrating that as little as 30 minutes of curated data achieves 95% of full-data emotional fidelity.
MoVE achieves state-of-the-art performance on emotion-preserving English–Chinese S2ST.
If you find our work useful, please consider citing: