Interspeech 2026

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in S2ST

A Mixture-of-LoRA-Experts framework that preserves non-verbal vocalizations (laughter, crying, etc.) in speech-to-speech translation with remarkable data efficiency.

Anonymous Submission

Paper (Coming Soon) Dataset Model Demo

76%

NV Preservation

~1K hr

Dataset Scale

30 min

Min. Fine-tune Data

Models Compared

Interactive Demos

Explore Our Work

Listen to our dataset samples and compare model outputs across different emotion categories.

🎧

MoVE Dataset

Browse ~1,000 hours of bilingual (Chinese ↔ English) expressive speech pairs across 5 emotion categories with interactive waveform visualization.

Explore Dataset →

🎵

Model Comparison

Listen and compare speech outputs from 7 S2ST models side-by-side: OpenAI, SeamlessM4T, Kimi, and our MoVE across 6 emotion categories.

Compare Models →

Method

Our Approach

MoVE tackles the expressiveness gap in S2ST through three key contributions.

Scalable Data Pipeline

An automated generation-selection pipeline that creates high-quality expressive S2ST training data, producing a ~1,000-hour bilingual corpus with 5 balanced emotion categories.

Mixture of LoRA Experts

Five specialized vocalization adapters (Angry, Happy, Sad, Laugh, Crying) with a learned soft-weighting router that dynamically blends experts for hybrid expressive states.

AudioLLM Fine-tuning

First to fine-tune general-purpose AudioLLMs for end-to-end S2ST, demonstrating that as little as 30 minutes of curated data achieves 95% of full-data emotional fidelity.

Results

Key Findings

MoVE achieves state-of-the-art performance on emotion-preserving English–Chinese S2ST.

76%

NV Reproduction Rate

vs. ≤14% for existing S2ST systems

Human Naturalness

Highest rated among all compared systems

95%

Data Efficiency

of full-data fidelity with just 30 min data

Citation

Cite Our Work

If you find our work useful, please consider citing:

@article{anonymous2026move, title = {MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation}, author = {Anonymous}, journal = {Interspeech 2026 (Under Review)}, year = {2026} }