LipVoicer: Generating Speech from Silent Videos
Guided by Lip Reading






PESQ

Using intrusive measures like PESQ and STOI in our problem is fundamentally incorrect as they cannot differentiate between another valid speaker and added noise. We Show this in the following experiment: We took a speech signal x and created two versions of it. One is y=x+n, where n is a Gaussian noise with SNR=5dB. Second is z, output of a voice conversion system where the input is x. In other words, z is identical to x with respect to the spoken words and their timing, but with a different yet fairly close voice. We computed the five metrics and received the results below. The intrusive metrics consider the voice converted version as equivalent to the highly noisy signal y, which is obviously wrong.


PESQ STOI ESTOI STOI-Net DNS-MOS
Clean ( x ) - - - 0.86 3.31
Noisy ( y ) 1.38 0.74 0.52 0.74 2.45
Cloned ( z ) 1.14 0.76 0.56 0.92 3.15




Noisy Clean Voice Converted





Long Videos






Accents






Face Embedding

We evaluated the impact of the face identity embedding by replacing the face image by the null token which is used for training with classifier-free guidance. The full model is slightly better, possibly because the model was not trained when only the face image was replaced with the null embedding (the lip region video was also replaced). In any case, the main aspect in which discarding of face embedding is manifested is the lack of personalized voice for each video. Below are examples which compare audio generated by the full and ablated models.


WER STOI-Net DNS-MOS LSE-C LSE-D
w/o 25.8% 0.92 3.05 6.091 8.407
with 24.1% 0.92 3.11 6.239 8.266


With Face Embedding Without Face Embedding