Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

Related tags

Deep LearningLDNet
Overview

LDNet

Author: Wen-Chin Huang (Nagoya University) Email: [email protected]

This is the official implementation of the paper "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech". This is a model that takes an input synthetic speech sample and outputs the simulated human rating.

Results

Usage

Currently we support only the VCC2018 dataset. We plan to release the BVCC dataset in the near future.

Requirements

  • PyTorch 1.9 (versions not too old should be fine.)
  • librosa
  • pandas
  • h5py
  • scipy
  • matplotlib
  • tqdm

Data preparation

# Download the VCC2018 dataset.
cd data
./download.sh vcc2018

Training

We provide configs that correspond to the following rows in the above figure:

  • (a): MBNet.yaml
  • (d): LDNet_MobileNetV3_RNN_5e-3.yaml
  • (e): LDNet_MobileNetV3_FFN_1e-3.yaml
  • (f): LDNet-MN_MobileNetV3_RNN_FFN_1e-3_lamb4.yaml
  • (g): LDNet-ML_MobileNetV3_FFN_1e-3.yaml
python train.py --config configs/<config_name> --tag <tag_name>

By default, the experimental results will be stored in exp/<tag_name>, including:

  • model-<steps>.pt: model checkpoints.
  • config.yml: the config file.
  • idtable.pkl: the dictionary that maps listener to ID.
  • training_<inference_mode>: the validation results generated along the training. This file is useful for model selection. Note that the inference_mode in the config file decides what mode is used during validation in the training.

There are some arguments that can be changed:

  • --exp_dir: The directory for storing the experimental results.
  • --data_dir: The data directory. Default is data/vcc2018.
  • seed: random seed.
  • update_freq: This is very important. See below.

Batch size and update_freq

By default, all LDNet models are trained with a batch size of 60. In my experiments, I used a single NVIDIA GeForce RTX 3090 with 24GB mdemory for training. I cannot fit the whole model in the GPU, so I accumulate gradients for update_freq forward passes and do one backward update. Before training, please check the train_batch_size in the config file, and set update_freq properly. For instance, in configs/LDNet_MobileNetV3_FFN_1e-3.yaml the train_batch_size is 20, so update_freq should be set to 3.

Inference

python inference.py --tag LDNet-ML_MobileNetV3_FFN_1e-3 --mode mean_listener

Use mode to specify which inference mode to use. Choices are: mean_net, all_listeners and mean_listener. By default, all checkpoints in the exp directory will be evaluated.

There are some arguments that can be changed:

  • ep: if you want to evaluate one model checkpoint, say, model-10000.pt, then simply pass --ep 10000.
  • start_ep: if you want to evaluate model checkpoints after a certain steps, say, 10000 steps later, then simply pass --start_ep 10000.

There are some files you can inspect after the evaluation:

  • <dataset_name>_<inference_mode>.csv: the validation and test set results.
  • <dataset_name>_<inference_mode>_<test/valid>/: figures that visualize the prediction distributions, including;
    • <ep>_distribution.png: distribution over the score range (1-5).
    • <ep>_utt_scatter_plot_utt: utterance-wise scatter plot of the ground truth and the predicted scores.
    • <ep>_sys_scatter_plot_utt: system-wise scatter plot of the ground truth and the predicted scores.

Acknowledgement

This repository inherits from this great unofficial MBNet implementation.

Citation

If you find this recipe useful, please consider citing following paper:

@article{huang2021ldnet,
  title={LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech},
  author={Huang, Wen-Chin and Cooper, Erica and Yamagishi, Junichi and Toda, Tomoki},
  journal={arXiv preprint arXiv:2110.09103},
  year={2021}
}
Owner
Wen-Chin Huang (unilight)
Ph.D. candidate at Nagoya University, Japan. M.S. @ Nagoya University. B.S. @ National Taiwan University. RA at IIS, Academia Sinica, Taiwan.
Wen-Chin Huang (unilight)
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

910 Dec 28, 2022
A Pytorch implementation of "Manifold Matching via Deep Metric Learning for Generative Modeling" (ICCV 2021)

Manifold Matching via Deep Metric Learning for Generative Modeling A Pytorch implementation of "Manifold Matching via Deep Metric Learning for Generat

69 Dec 10, 2022
Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings

Text2Music Emotion Embedding Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings Reference Emotion Embedding Spaces for Matching

Minz Won 50 Dec 05, 2022
Python implementation of Project Fluent

Project Fluent This is a collection of Python packages to use the Fluent localization system. python-fluent consists of these packages: fluent.syntax

Project Fluent 155 Dec 28, 2022
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
MLPs for Vision and Langauge Modeling (Coming Soon)

MLP Architectures for Vision-and-Language Modeling: An Empirical Study MLP Architectures for Vision-and-Language Modeling: An Empirical Study (Code wi

Yixin Nie 27 May 09, 2022
This project is based on our SIGGRAPH 2021 paper, ROSEFusion: Random Optimization for Online DenSE Reconstruction under Fast Camera Motion .

ROSEFusion 🌹 This project is based on our SIGGRAPH 2021 paper, ROSEFusion: Random Optimization for Online DenSE Reconstruction under Fast Camera Moti

219 Dec 27, 2022
Python parser for DTED data.

DTED Parser This is a package written in pure python (with help from numpy) to parse and investigate Digital Terrain Elevation Data (DTED) files. This

Ben Bonenfant 12 Dec 18, 2022
Evaluating AlexNet features at various depths

Linear Separability Evaluation This repo provides the scripts to test a learned AlexNet's feature representation performance at the five different con

Yuki M. Asano 32 Dec 30, 2022
Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

TopClus The source code used for Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations, published in WWW 2022. Requ

Yu Meng 63 Dec 18, 2022
Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Mask R-CNN for Object Detection and Segmentation This is an implementation of Mask R-CNN on Python 3, Keras, and TensorFlow. The model generates bound

Matterport, Inc 22.5k Jan 04, 2023
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
Recursive Bayesian Networks

Recursive Bayesian Networks This repository contains the code to reproduce the results from the NeurIPS 2021 paper Lieck R, Rohrmeier M (2021) Recursi

Robert Lieck 11 Oct 18, 2022
Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

OpenAI 2.9k Jan 04, 2023
[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy Codes for this paper: [CVPR 2022] The Pr

VITA 16 Nov 26, 2022
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Coder.AN 1 Mar 05, 2022
RNN Predict Street Commercial Vitality

RNN-for-Predicting-Street-Vitality Code and dataset for Predicting the Vitality of Stores along the Street based on Business Type Sequence via Recurre

Zidong LIU 1 Dec 15, 2021
PyTorch implementation of PSPNet

PSPNet with PyTorch Unofficial implementation of "Pyramid Scene Parsing Network" (https://arxiv.org/abs/1612.01105). This repository is just for caffe

Kazuto Nakashima 52 Nov 16, 2022
chainladder - Property and Casualty Loss Reserving in Python

chainladder (python) chainladder - Property and Casualty Loss Reserving in Python This package gets inspiration from the popular R ChainLadder package

Casualty Actuarial Society 130 Dec 07, 2022
Graph Neural Networks with Keras and Tensorflow 2.

Welcome to Spektral Spektral is a Python library for graph deep learning, based on the Keras API and TensorFlow 2. The main goal of this project is to

Daniele Grattarola 2.2k Jan 08, 2023