GAN-Based Prosody Modeling and Character Voice Control for Audiobook Speech Synthesis

Abstract

Conventional speech synthesis techniques have made significant strides towards achieving human-like performance. However, the domain of audiobook speech synthesis still presents notable challenges. On one hand, the speech in audiobooks exhibits rich prosody expressiveness, posing substantial difficulties in prosody modeling. On the other hand, the reader of audiobooks uses different voices to perform dialogues of different characters, which has been inadequately explored in existing speech synthesis methods. To address the first challenge, we integrate discourse-scale prosody modeling into the conventional autoencoder-based framework and introduce generative adversarial networks (GANs) for phoneme-level prosody code prediction. Regarding the second challenge, we further explore a character voice encoder based on the pretrained speaker verification model, integrating it into our proposed method. Experimental results validate that the proposed method enhances the prosodic expressiveness of synthesized audiobook speech. Moreover, it demonstrates the capacity to produce distinctive voices for different audiobook characters without compromising the naturalness of the synthesized speech.

This page is for research demonstration purposes only.

Model Overview

Demos


Comparative experiments

Text

皇帝坐在椅子上似乎有些心事,发了好一会呆。 没了巨盾,王阔海奔跑速度变得快起来,可也少了遮挡。 沈冷只是把自己的身体压在那,仅仅靠着自身的重量而已。 最终他多看了两眼孟长安手中的小猎刀,于是沈冷嘴角多了些老母亲般的微笑。 其中一个老妇取出个瓶子,倒出来几粒药丸,分给其他人。
proposed
proposed+
FastSpeech 2
GST
VITS2
StyleTTS
ground truth

Ablation study on BERT and GAN

Text

风声,雨声,呼啸而过,张小凡却只觉得,自己的脑海中一片空白。 二人向前又走了几步,曾书书忙着端详怀里的小灰,张小凡却似是满腹心思,沉默不语。 这时,这个草庙之内,在电光强烈照耀之下,已如白昼。 又走了一会,但见林中古木参天,阴气阵阵,看来已到树林深处。 但见他细眉方脸,眉目儒雅,与刚才那些凶狠粗豪的魔教中人大不相同。
proposed+
proposed+ w/o GAN
proposed+ w/o BERT
proposed+ w/o FML
proposed+ w/o condition

Ablation study on discourse-scale modeling

Text

风声,雨声,呼啸而过,张小凡却只觉得,自己的脑海中一片空白。 二人向前又走了几步,曾书书忙着端详怀里的小灰,张小凡却似是满腹心思,沉默不语。 这时,这个草庙之内,在电光强烈照耀之下,已如白昼。 又走了一会,但见林中古木参天,阴气阵阵,看来已到树林深处。 但见他细眉方脸,眉目儒雅,与刚才那些凶狠粗豪的魔教中人大不相同。
proposed+ (K=5)
proposed+ (K=3)
proposed+ (K=0)

Ablation study of character voice encoder (CVE)

Text

皇帝坐在椅子上似乎有些心事,发了好一会呆。 没了巨盾,王阔海奔跑速度变得快起来,可也少了遮挡。 沈冷只是把自己的身体压在那,仅仅靠着自身的重量而已。 “不是,我不是让你说媒,我的意思是啊,女大不由爹。” “若真的招你做御厨,你带我做配菜可好?” “朕好像已经有差不多七八年没来过这了。”
proposed+ (SV)
proposed+ w/o CVE (SV)

Effectiveness of character voice control (with coefficient mentioned in Section 3.5)

Reference wav

hero heroine emperor empress elder child

Demos

Text

“不是,我不是让你说媒,我的意思是啊,女大不由爹。” “若真的招你做御厨,你带我做配菜可好?” “朕好像已经有差不多七八年没来过这了。” “因为说了,就会有太多太多人死,那可能要灭三族,也许是九族啊。” “我是什么都好,皇后倒是这么多年来没变过,一直都是那只母狼。”
ground truth
single voice (SV)
ground-truth voice (GV)
voice control: hero (x0.5)
voice control: hero (x1.0)
voice control: hero (x1.5)
voice control: heroine (x0.5)
voice control: heroine (x1.0)
voice control: heroine (x1.5)
voice control: emperor (x0.5)
voice control: emperor (x1.0)
voice control: emperor (x1.5)
voice control: empress (x0.5)
voice control: empress (x1.0)
voice control: empress (x1.5)
voice control: elder (x0.5)
voice control: elder (x1.0)
voice control: elder (x1.5)
voice control: child (x0.5)
voice control: child (x1.0)
voice control: child (x1.5)

Ablation study of duration modeling

Text

风声,雨声,呼啸而过,张小凡却只觉得,自己的脑海中一片空白。 二人向前又走了几步,曾书书忙着端详怀里的小灰,张小凡却似是满腹心思,沉默不语。 这时,这个草庙之内,在电光强烈照耀之下,已如白昼。 又走了一会,但见林中古木参天,阴气阵阵,看来已到树林深处。 但见他细眉方脸,眉目儒雅,与刚才那些凶狠粗豪的魔教中人大不相同。
proposed+ (K=5)
proposed+ (K=0)
proposed+ w/o BERT (K=5)
proposed+ w/o BERT (K=0)