

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. This work is about creating facial animation from speech in real time.

親愛的學霸們,這裡是由 Károly Zsolnai-Fehér 帶來的兩分鐘論文。這篇文章是關於根據語音實時產生面部動畫的。

This means that after recording the audio footage of us speaking, we give it to a learning algorithm, which creates a high-quality animation depicting our digital characters uttering these words. This learning algorithm is a Convolutional Neural Network, which was trained on as little as three to five minutes of footage per actor, and was able to generalize its knowledge from this training data to a variety of real-world expressions and words.


And if you think you've seen everything, you should watch until the end of the video as it gets better than that because of two reasons. Reason number one, it not only takes audio input, but we can also specify an emotional state that the character should express when uttering these words.


… … … …


… … … …


… … … …


Number two, and this is the best part, we can also combine this together with DeepMind's WaveNet, which synthesizes audio from our text input. It basically synthesizes a believable human voice and says whatever text we write down.

第二點,也是最棒的一點,我們也可以把它和 DeepMind 的 Wavenet 相結合,Wavenet 可以根據文本輸入合成音頻。無論輸入什麼文本,它基本上都可以合成為比較可信的人聲。

And then that sound clip can be used with this technique to make a digital character say what we've written. So we can go from text to speech with WaveNet, and put the speech onto a virtual actor with this work.

然後這個技術就可以用此聲音片段來生成一個虛擬人物來說我們寫下的話。所以我們就可以用 WaveNet 實現從文本到音頻的轉化,然後用這一技術讓虛擬人物說出這段話。

This way, we get a whole pipeline that works by learning and does everything for us in the most convenient way. No actors needed for voiceovers.


No motion capture for animations. This is truly incredible.


And if you look at the left side, you can see that in their video, there is some Two Minute Papers action going on. How cool is that?


Make sure to have a look at the paper to see the three-way loss function the authors came up with to make sure that the results work correctly for longer animations. And of course, in research, we have to prove that our results are better than previous techniques.


To accomplish this, there are plenty of comparisons in the supplementary video. But we need more than that.


Since these results cannot be boiled down to a mathematical theorem that we need to prove, we have to do it some other way. And the ultimate goal is that a human being would judge these videos as being real with a higher chance than one made with a previous technique.


This is the core idea behind the user study carried out in the paper. We bring in a bunch of people, present them with a video of the old and new technique without knowing which is which, and ask them which one they feel to be more natural.


And the result was not even close — the new method is not only better overall, but I haven't found a single case, scenario or language where it didn't come out ahead. And that's extremely rare in research.


Typically, in a maturing field, new techniques introduce a different kind of trade-off, for instance, less execution time but at the cost of higher memory consumption is a classical case. But here, it's just simply better in every regard.


Excellent. … …

太讚了。用遊戲引擎最終渲染——綠美迪娛樂的北極光引擎——高級過程控件來驅動眼睛 [左:基於視頻的捕捉圖像;右:我們基於音頻生成的結果]

If you enjoyed this episode, and would like to help us make better videos in the future, please consider supporting us on Patreon. You can pick up cool perks like watching these episodes in early access.

如果你喜歡本期節目,並且願意幫助我們在未來做得更好,就可以考慮在 Patreon 上支持我們。你可以獲得很棒的福利,比如更早看到這些視頻的更新。

Details are available in the video description. Beyond telling these important research stories, we're also using part of these funds to empower other research projects.


I just made a small write-up about this which is available on our Patreon page. The link is in the video description, make sure to have a look.

我剛剛寫了一篇關於這個問題的小文章,可以在我們的主頁 Patreon 上找到它。鏈接在視頻描述裡,一定要去看一看。

Thanks for watching and for your generous support, and I'll see you next time!


