On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models


Miri Varshavsky-Hassid*, Roy Hirsch*, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin

Verily AI

*Equal contributors


[ACL 2024] [arXiv] [BibTex]


Abstract

The incorporation of Denoising Diffusion Models (DDMs) in the Text-to-Speech (TTS) domain is rising, providing great value in synthesizing high quality speech. Although they exhibit impressive audio quality, the extent of their semantic capabilities is unknown, and controlling their synthesized speech's vocal properties remains a challenge. Inspired by recent advances in image synthesis, we explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We then demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements. We present evidence of the semantic and acoustic qualities of the edited audio, and provide supplemental samples on this page.

Overview

We propose a simple yet effective semantic audio-editing method that can be applied to any frozen diffusion-based TTS model that contains a bottleneck. First, a latent semantic direction is defined either in a supervised or an unsupervised manner, by capturing the latent space of example speech samples during their generation process. Then, the corresponding speech attribute is edited by applying that direction to the latent space during the generation process of a new speech sample. The method is demonstrated mostly with the male-to-female editing direction and the publicly available GradTTS model.

Supervised latent space editing

Latent space editing during sample generation along a direction defined in a supervised manner allows control over selected vocal attributes, such as the perceived speaker's gender. Such an editing direction can be defined by selecting several pairs of positive (eg. male) and negative (eg. female) samples, and averaging the pairwise differences of their latent vectors.

Then, applying that direction with different scales to the latent vector of generated samples alters that vocal attribute of the sample, by different magnitudes. Note that when λ = 0, no editing is performed.


Male-to-female editing

Input text Source gender: Female Source gender: Male

1

λ = 0

λ = 0

λ = 0

λ = 0

2

λ = 0

λ = 0

λ = 0

λ = 0

3

λ = 0

λ = 0

λ = 0

λ = 0

Speaker embedding space editing

A baseline comparable to latent space editing can be speaker embedding editing. Here, the editing direction for a specific vocal attribute is defined in a similar manner as before, using the sample's speaker embedding instead of its latent vector: selecting several pairs of positive (eg. male) and negative (eg. female) samples, the pairwise differences of their speaker embeddings are averaged.

However, applying that direction with different scales to the speaker embedding and using it to generate samples, does not produce satisfying results.


Male-to-female editing

Notice the collapse to male voice in both ends of the scale, as well as the non-interpretable speech when using larger scales.


Input text Source gender: Female Source gender: Male

1

λ = 0

λ = 0

λ = 0

λ = 0

2

λ = 0

λ = 0

λ = 0

λ = 0

3

λ = 0

λ = 0

λ = 0

λ = 0

Principal Component latent space editing

The latent space of diffusion-based TTS models inherently captures semantic information. Specifically, the projections of latent vectors of samples onto the first Principal Components (PCs) correlate with specific vocal attributes of the generated speech samples. Furthermore, interpolation along the first PCs of the latent space alters these vocal attributes of the generated samples.


PC1 - Male-to-female editing

Input text Source gender: Female Source gender: Male

1

λ = 0

λ = 0

λ = 0

λ = 0

2

λ = 0

λ = 0

λ = 0

λ = 0

3

λ = 0

λ = 0

λ = 0

λ = 0

PC2 - Intensity and HNR editing

* Note the speech enhancement when moving to the negative side of the scale (e.g. λ = -2)

Input text Speech samples

1

λ = 0

λ = 0

λ = 0

λ = 0

2

λ = 0

λ = 0

λ = 0

λ = 0

3

λ = 0

λ = 0

λ = 0

λ = 0

BibTex

            @inproceedings{varshavsky-hassid-etal-2024-semantic,
                title = "On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models",
                author = "Varshavsky-Hassid, Miri  and
                  Hirsch, Roy  and
                  Cohen, Regev  and
                  Golany, Tomer  and
                  Freedman, Daniel  and
                  Rivlin, Ehud",
                editor = "Ku, Lun-Wei  and
                  Martins, Andre  and
                  Srikumar, Vivek",
                booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
                month = aug,
                year = "2024",
                address = "Bangkok, Thailand",
                publisher = "Association for Computational Linguistics",
                url = "https://aclanthology.org/2024.acl-short.24",
                pages = "246--255",
            }