人工智能根據音頻製作面部動畫


人工智能根據音頻製作面部動畫


Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. This work is about creating facial animation from speech in real time.

親愛的學霸們,這裡是由 Károly Zsolnai-Fehér 帶來的兩分鐘論文。這篇文章是關於根據語音實時產生面部動畫的。

This means that after recording the audio footage of us speaking, we give it to a learning algorithm, which creates a high-quality animation depicting our digital characters uttering these words. This learning algorithm is a Convolutional Neural Network, which was trained on as little as three to five minutes of footage per actor, and was able to generalize its knowledge from this training data to a variety of real-world expressions and words.

這意味著錄製完我們說話的音頻視頻之後,我們把數據給一個學習算法,它就能夠生成高質量的動畫來描繪說這些話的虛擬人物。這個學習算法是一個卷積神經網絡,每個角色僅用了3-5分鐘的鏡頭來訓練,並且它能夠從訓練數據推及各種各樣現實世界中的表情和言語。

And if you think you've seen everything, you should watch until the end of the video as it gets better than that because of two reasons. Reason number one, it not only takes audio input, but we can also specify an emotional state that the character should express when uttering these words.

如果你覺得你已經瞭解充分了,那你應該堅持看到視頻結束,因為它比你想象的要更厲害,有兩點原因可以說明。第一個原因,它不單單可以接受音頻輸入,而且我們還可以指定角色說這些話時應表現出的情緒。

… … … …

[訓練場景:角色處於傷心或擔心的狀態][新表現:同一情緒應用於其他音頻]

… … … …

[訓練場景:人物表現出輕微的驚訝][新表現:同一情緒應用於其他音頻]

… … … …

[訓練場景:人物處於疼痛中][新表現:同一情緒應用於其他音頻]

Number two, and this is the best part, we can also combine this together with DeepMind's WaveNet, which synthesizes audio from our text input. It basically synthesizes a believable human voice and says whatever text we write down.

第二點,也是最棒的一點,我們也可以把它和 DeepMind 的 Wavenet 相結合,Wavenet 可以根據文本輸入合成音頻。無論輸入什麼文本,它基本上都可以合成為比較可信的人聲。

And then that sound clip can be used with this technique to make a digital character say what we've written. So we can go from text to speech with WaveNet, and put the speech onto a virtual actor with this work.

然後這個技術就可以用此聲音片段來生成一個虛擬人物來說我們寫下的話。所以我們就可以用 WaveNet 實現從文本到音頻的轉化,然後用這一技術讓虛擬人物說出這段話。

This way, we get a whole pipeline that works by learning and does everything for us in the most convenient way. No actors needed for voiceovers.

這樣,我們就得到了一整套的轉化途徑,它可以學習並用最便捷的方式完成一切工作。不再需要畫外音演員。

No motion capture for animations. This is truly incredible.

不再需要動作捕捉來做動畫。這很不可思議。

And if you look at the left side, you can see that in their video, there is some Two Minute Papers action going on. How cool is that?

看看左邊,視頻裡有一些正在進行轉換中的兩分鐘論文。這多酷啊?

Make sure to have a look at the paper to see the three-way loss function the authors came up with to make sure that the results work correctly for longer animations. And of course, in research, we have to prove that our results are better than previous techniques.

一定要去看一下這篇論文,看一下作者提出的保證長時動畫能夠有效運作的三路損失函數。當然,在研究中,我們必須證明我們的結果優於之前的技術。

To accomplish this, there are plenty of comparisons in the supplementary video. But we need more than that.

為了達成這點,在補充視頻中有很多的對比。但是我們需要的遠不止這些。

Since these results cannot be boiled down to a mathematical theorem that we need to prove, we have to do it some other way. And the ultimate goal is that a human being would judge these videos as being real with a higher chance than one made with a previous technique.

因為這些結果不能被歸結於一個我們需要去證明的數學定理,我們必須換種方式。最終目標是能讓人認為這些視頻更像真人,而不是用以往的技術合成的。

This is the core idea behind the user study carried out in the paper. We bring in a bunch of people, present them with a video of the old and new technique without knowing which is which, and ask them which one they feel to be more natural.

這就是這篇論文進行用戶研究的核心思想。我們找一群人,分別給他們展示用老技術和新技術合成的視頻,不告訴他們哪個是哪個,然後問他們覺得哪一個更自然。

And the result was not even close — the new method is not only better overall, but I haven't found a single case, scenario or language where it didn't come out ahead. And that's extremely rare in research.

結果證明兩個技術相差甚遠,新技術不僅總體上效果更好,而且從場景或語言上來看,新技術在所有案例中都遙遙領先。這一點在研究中極其難得。

Typically, in a maturing field, new techniques introduce a different kind of trade-off, for instance, less execution time but at the cost of higher memory consumption is a classical case. But here, it's just simply better in every regard.

通常來說,在一個成熟的領域,新技術總是會引入一種不同的折中方法,比如更少的執行時間,但是代價是更高的內存消耗,這是一種典型的情況。但是在這裡,在所有的考察維度上它都更出色。

Excellent. … …

太讚了。用遊戲引擎最終渲染——綠美迪娛樂的北極光引擎——高級過程控件來驅動眼睛 [左:基於視頻的捕捉圖像;右:我們基於音頻生成的結果]

If you enjoyed this episode, and would like to help us make better videos in the future, please consider supporting us on Patreon. You can pick up cool perks like watching these episodes in early access.

如果你喜歡本期節目,並且願意幫助我們在未來做得更好,就可以考慮在 Patreon 上支持我們。你可以獲得很棒的福利,比如更早看到這些視頻的更新。

Details are available in the video description. Beyond telling these important research stories, we're also using part of these funds to empower other research projects.

更多詳情參見視頻描述。除了介紹這些重要的科研故事之外,我們也會用這一資金中的一部分去支持其他研究項目。

I just made a small write-up about this which is available on our Patreon page. The link is in the video description, make sure to have a look.

我剛剛寫了一篇關於這個問題的小文章,可以在我們的主頁 Patreon 上找到它。鏈接在視頻描述裡,一定要去看一看。

Thanks for watching and for your generous support, and I'll see you next time!

感謝收看和您的大力支持,我們下期再見!


分享到:


相關文章: