MoVE Dataset — Demo Page

Dataset Overview

The MoVE Dataset is a large-scale expressive speech-to-speech translation (S2ST) dataset designed to advance research in emotion-preserving cross-lingual speech synthesis. Each sample consists of a parallel Chinese–English audio pair produced with matching expressiveness, accompanied by the corresponding transcription text.

~1,000

Total Hours

~900k

Parallel Pairs

5 (Balanced)

Emotion Categories

Languages (ZH / EN)

😠 Angry — speech delivered with frustration, annoyance, or intensity
😊 Happy — cheerful, upbeat, joyful speech
😢 Sad — melancholic, sorrowful, or subdued speech
😂 Laugh — speech containing laughter or spoken in an amused manner
😭 Crying — speech with a weeping, sobbing, or distressed quality

Dataset Structure

The released dataset follows the directory structure below. Each audio pair shares the same identifier across languages.

1000hr/ ├── metadata.tsv # zh_path en_path zh_text en_text category ├── en/ │ ├── angry/ │ ├── happy/ │ ├── sad/ │ ├── laugh/ │ └── crying/ └── zh/ ├── angry/ ├── happy/ ├── sad/ ├── laugh/ └── crying/

metadata.tsv contains tab-separated columns: zh_path en_path zh_text en_text category

Audio Demos

Below are randomly sampled audio pairs from each expressiveness category. Each sample shows the Chinese and English audio along with available transcription text.

📦

Data Availability. The full dataset (~1,000 hours) will be publicly released upon the camera-ready version of our paper. A download link will be provided at that time. Stay tuned!

Citation

If you use this dataset in your research, please cite our paper:

@article{anonymous2026move, title = {MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation}, author = {Anonymous}, journal = {Under Review}, year = {2026} }