CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Related tags

Text Data & NLPcvss
Overview

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

License: CC BY 4.0

CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation corpus. The translation speech in CVSS is synthesized with two state-of-the-art TTS models trained on the LibriTTS corpus.

CVSS includes two versions of spoken translation for all the 21 x-en language pairs from CoVoST 2, with each version providing unique values:

  • CVSS-C: All the translation speeches are in a single canonical speaker's voice. Despite being synthetic, these speeches are of very high naturalness and cleanness, as well as having consistent speaking style. These properties ease the modelling of the target speech and enable models to produce high quality translation speech suitable for user-facing applications.

  • CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. Each translation pair has similar voices on the two sides despite of being in different languages, making this dataset suitable for building models that preserve speakers' voices when translate speech into different languages.

In together with the source speeches originated from Common Voice, they make two multilingual speech-to-speech tranlsation datasets each with about 1,900 hours of speech.

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (e.g. on numbers, currencies, acronyms, etc.), which can be use for both model training as well as standalizing evaluation.

Please check out our paper for the detailed description of this corpus, as well as the baseline models we trained on both datasets.

Getting the data

The translation speech and the normalized translation text in CVSS can be downloaded from the links in the following table:

Source language Code CVSS-C CVSS-T
Arabic ar link link
Catalan ca link link
Welsh cy link link
German de link link
Estonian et link link
Spanish es link link
Persian fa link link
French fr link link
Indonesian id link link
Italian it link link
Japanese ja link link
Latvian lv link link
Mongolian mn link link
Dutch nl link link
Portuguese pt link link
Russian ru link link
Slovenian sl link link
Swedish sv link link
Tamil ta link link
Turkish tr link link
Chinese zh link link

Each tar.gz file in the links above includes train, dev and test directories containing audio clips as the translation speech, as well as train.tsv, dev.tsv and test.tsv files containing the normalized translation text. The normalized translation text files included in CVSS-C and CVSS-T are identical.

These translation audio clips and translation texts are to be paired with the Common Voice release version 4 (required) based on the audio file names. If you need the original translation text without the normalization, they are provided by CoVoST 2.

License

CVSS is released under the very permissive Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

Please cite this paper when referencing the CVSS corpus:

@misc{jia2022cvss,
    title={{CVSS} Corpus and Massively Multilingual Speech-to-Speech Translation},
    author={Jia, Ye and Tadmor Ramanovich, Michelle and Wang, Quan and Zen, Heiga},
    eprint={2201.03713},
    archivePrefix={arXiv},
    year={2022}
}
Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

NAME inflect.py - Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. SYNOPSIS import inflect p = in

Jason R. Coombs 762 Dec 29, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Pytorch implementation of Tacotron

Tacotron-pytorch A pytorch implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. Requirements Install python 3 Install pytorc

soobin seo 203 Dec 02, 2022
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022