MoVE Dataset

A large-scale bilingual (Chinese ↔ English) expressive speech-to-speech translation dataset spanning ~1,000 hours and ~900k parallel pairs, presented in our paper:
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

🎙️ ~1,000 Hours 🔢 ~900k Pairs 🌏 Chinese ↔ English 🎭 5 Emotions 📝 Parallel Text

Dataset Overview

The MoVE Dataset is a large-scale expressive speech-to-speech translation (S2ST) dataset designed to advance research in emotion-preserving cross-lingual speech synthesis. Each sample consists of a parallel Chinese–English audio pair produced with matching expressiveness, accompanied by the corresponding transcription text.

~1,000
Total Hours
~900k
Parallel Pairs
5 (Balanced)
Emotion Categories
2
Languages (ZH / EN)

Dataset Structure

The released dataset follows the directory structure below. Each audio pair shares the same identifier across languages.

1000hr/ ├── metadata.tsv # zh_path en_path zh_text en_text category ├── en/ │ ├── angry/ │ ├── happy/ │ ├── sad/ │ ├── laugh/ │ └── crying/ └── zh/ ├── angry/ ├── happy/ ├── sad/ ├── laugh/ └── crying/

metadata.tsv contains tab-separated columns: zh_path en_path zh_text en_text category

Audio Demos

Below are randomly sampled audio pairs from each expressiveness category. Each sample shows the Chinese and English audio along with available transcription text.

📦

Data Availability. The full dataset (~1,000 hours) will be publicly released upon the camera-ready version of our paper. A download link will be provided at that time. Stay tuned!

Citation

If you use this dataset in your research, please cite our paper:

@article{anonymous2026move, title = {MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation}, author = {Anonymous}, journal = {Under Review}, year = {2026} }