MoVE Dataset

A large-scale bilingual (Chinese ↔ English) expressive speech-to-speech translation dataset spanning ~1,000 hours and 858,312 parallel pairs, presented in our paper:
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

🎙️ ~1,000 Hours 🔢 858,312 Pairs 🌏 Chinese ↔ English 🎭 5 Emotions 📝 Parallel Text
Dataset on HuggingFace Data Generation Pipeline

Dataset Overview

The MoVE Dataset is a large-scale expressive speech-to-speech translation (S2ST) dataset designed to advance research in emotion-preserving cross-lingual speech synthesis. Each sample consists of a parallel Chinese–English audio pair produced with matching expressiveness, accompanied by the corresponding transcription text.

~1,000
Total Hours
858,312
Parallel Pairs
5 (Balanced)
Emotion Categories
2
Languages (ZH / EN)

Dataset Structure

The released dataset follows the directory structure below. Each audio pair shares the same identifier across languages.

1000hr/ ├── metadata.tsv # zh_path en_path zh_text en_text category ├── en/ │ ├── angry/ │ ├── happy/ │ ├── sad/ │ ├── laugh/ │ └── crying/ └── zh/ ├── angry/ ├── happy/ ├── sad/ ├── laugh/ └── crying/

metadata.tsv contains tab-separated columns: zh_path en_path zh_text en_text category

Audio Demos

Below are randomly sampled audio pairs from each expressiveness category. Each sample shows the Chinese and English audio along with available transcription text.

📦

Data Availability. The full dataset (~1,000 hours) will be publicly released upon the camera-ready version of our paper. A download link will be provided at that time. Stay tuned!

Citation

If you use this dataset in your research, please cite our paper:

@article{chen2026move, title = {MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation}, author = {Chen, Szu-Chi and Tsai, I-Ning and Lin, Yi-Cheng and Huang, Sung-Feng and Lee, Hung-yi}, journal = {arXiv preprint arXiv:2604.17435}, year = {2026}, eprint = {2604.17435}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2604.17435} }