Dataset Overview
The MoVE Dataset is a large-scale expressive speech-to-speech translation (S2ST) dataset designed to advance research in emotion-preserving cross-lingual speech synthesis. Each sample consists of a parallel Chinese–English audio pair produced with matching expressiveness, accompanied by the corresponding transcription text.
- Angry — speech delivered with frustration, annoyance, or intensity
- Happy — cheerful, upbeat, joyful speech
- Sad — melancholic, sorrowful, or subdued speech
- Laugh — speech containing laughter or spoken in an amused manner
- Crying — speech with a weeping, sobbing, or distressed quality
Dataset Structure
The released dataset follows the directory structure below. Each audio pair shares the same identifier across languages.
metadata.tsv contains tab-separated columns:
zh_path
en_path
zh_text
en_text
category
Audio Demos
Below are randomly sampled audio pairs from each expressiveness category. Each sample shows the Chinese and English audio along with available transcription text.
Data Availability. The full dataset (~1,000 hours) will be publicly released upon the camera-ready version of our paper. A download link will be provided at that time. Stay tuned!
Citation
If you use this dataset in your research, please cite our paper: