LipVoicer

PESQ

Using intrusive measures like PESQ and STOI in our problem is fundamentally incorrect as they cannot differentiate between another valid speaker and added noise. We Show this in the following experiment: We took a speech signal x and created two versions of it. One is y=x+n, where n is a Gaussian noise with SNR=5dB. Second is z, output of a voice conversion system where the input is x. In other words, z is identical to x with respect to the spoken words and their timing, but with a different yet fairly close voice. We computed the five metrics and received the results below. The intrusive metrics consider the voice converted version as equivalent to the highly noisy signal y, which is obviously wrong.

	PESQ	STOI	ESTOI	STOI-Net	DNS-MOS
Clean ( `x` )	-	-	-	0.86	3.31
Noisy ( `y` )	1.38	0.74	0.52	0.74	2.45
Cloned ( `z` )	1.14	0.76	0.56	0.92	3.15

Noisy	Clean	Voice Converted

Long Videos

Accents

Face Embedding

We evaluated the impact of the face identity embedding by replacing the face image by the null token which is used for training with classifier-free guidance. The full model is slightly better, possibly because the model was not trained when only the face image was replaced with the null embedding (the lip region video was also replaced). In any case, the main aspect in which discarding of face embedding is manifested is the lack of personalized voice for each video. Below are examples which compare audio generated by the full and ablated models.

	WER	STOI-Net	DNS-MOS	LSE-C	LSE-D
w/o	25.8%	0.92	3.05	6.091	8.407
with	24.1%	0.92	3.11	6.239	8.266

With Face Embedding	Without Face Embedding

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

LipVoicer: Generating Speech from Silent Videos
Guided by Lip Reading