Emotional Speech-Driven Animation with Content-Emotion Disentanglement(EMOTE) 논문 리뷰 | Gihun Son

Emotional Speech-Driven Animation with Content-Emotion Disentanglement(EMOTE) 논문 리뷰

Posted Dec 18, 2023

By Gihun Son 16 min read

Figure 1

‘audio input’과 ‘emotion label’이 주어졌을 때, “EMOTE”는 ‘emotion’을 표현하면서 SOTA ‘lip synchronization’을 갖는 ‘animated 3D head’를 생성한다.
EMOTE는 새로운 ‘video emotion loss’와 ‘speech’로부터 emotion을 disentangling하는 mechanism을 사용하여 ‘2D vidieo sequences’를 training한다.

Abstract

많은 분야에 사용하기 위해, ‘3D facial avatars’는 ‘speech signals’를 통해 쉽고, 사실적이고, 직접적으로 animate되어야 한다.
최근 가장 성능이 좋은 method들은 input audio와 synchronize되는 3D animation을 생성하지만, 대부분 ‘facial expression’에 대한 ‘emotion’의 영향을 무시한다.
사실적인 facial animation은 자연스러운 emotion의 expression이 있는 ‘lip-sync’가 필요하다.
위 목적을 달성하기 위해서 논문은 “EMOTE(Expressive Model Optimized for Talking with Emotion)”을 제안한다.

⇒ ‘emotion’의 ‘expression’에 대한 명확한 통제가 가능하면서, 동시에 speech에 대한 ‘lip-sync’는 유지하는 ‘3Dtalking-head avatars’를 생성

위 모델을 얻기 위해, ‘speech’와 ‘emotion’을 위한 분리된 loss들을 사용하여 “EMOTE”를 supervised learning하였다.
loss들은 2가지 주요 관점에 의거한다.
- (1) ‘speech’에 의한 ‘face deformation’은 공간적으로 mouth 주변에 존재하고, ‘high temporal frequency(변화가 많음)’를 갖는다.
- (2) ‘facial expressions’는 전체 ‘face’를 deformation할 가능성이 있고, 더 긴 시간동안 발생한다.

⇒ 따라서 “EMOTE”는 ‘lip-reading loss’는 ‘speech-dependent content’를 보존하기 위해 ‘per-frame’단위로 training하고, ‘emotion’에 대해서는 ‘sequence level’로 supervised learning을 한다.

또한, 논문은 ‘content-emotion exchange mechanism’을 사용한다.

⇒ speech와 synchronize된 ‘lip motion’을 유지하면서, 같은 audio상의 서로 다른 ‘emotions’를 supervised하기 위해

적절하지 않은 artifacts없이 ‘deep perceptual losses’를 사용하기 위해, ‘temporal VAE’형태의 ‘motion prior’를 고안하였다.
좋은 품질의 ‘speech’에 align된 ‘emotional 3D face datasets’가 부족하기 때문에, “EMOTE”는 ‘emotional video dataset(i.e. MEAD)’에서 extract된 ‘3D pseudo-ground-truth’를 사용한다.

논문의 기여도 요약
1. ‘speech-driven 3D facial animation’의 semantic emotion editing을 한 첫번째 method
2. ‘perceptual lip-reading losses’와 ‘dynamic emotion losses’를 사용한 새로운 supervision mechanism이고, 새로운 ‘content-emotion disentanglement’ mechanism이다.
3. 자연스러운 ‘animation’을 유지하면서, ‘perceptual losses’를 통해 ‘facial motion’의 조절을 돕기 위해 디자인 된 ‘statistical prior’
4. ‘bidirectional non-autoregressive architecture이다.
  ⇒ ‘autoregressive transformer-based SOTA method’보다 효율적임

3. Background and Notation

Face model

EMOTE는 “FLAME(parametric 3D head model)”의 ‘expression paramters’와 ‘jaw pose parameters’를 예측한다.
FLAME은 다음과 같은 function으로 정의된다.
- $M(\beta,\theta,\psi) \rightarrow (V,F)$
  - identity shape $\beta \in \R^{|\beta|}$
  - facial expression $\psi \in \R^{|\psi|}$
  - pose $\theta \in \R^{3k+3}$, $k=4$ joints
  - 3D mesh vertex $V \in \R^{n_v \times 3}$
  - 3D mesh triangle $F \in \R^{n_f \times 3}$

Emotion feature extraction

논문은 image로부터 ‘emotion features’를 예측할 때, “EMOCA”의 ‘emotion recognition network’를 사용한다.(“AffectNet”으로 pre-train된 ‘ResNet-50’으로 구성되어 있음)

⇒ ‘in-the-wild emotion recognition’ task인 ‘valence’, ‘arousal’, 8개의 기본 expression(neutral, happiness, sadness, surprise, fear, disgust, anger, contempt)를 ‘classifiction’을 regression

network를 training한 이후에, prediction head(FC layer)는 쓰지 않는다.
마지막 ResNet layer의 output이 ‘emotion feature vector($\epsilon \in \R^{|\epsilon|}$)’로 사용된다.
network는 아래 식과 같이 나타낼 수 있다.

⇒ $E^{im}_{emo}(I) \rightarrow \epsilon$

Video emotion feature extraction

emotion은 일반적으로 몇초 또는 그 이상 지속되는 현상이다.
single-frame emotion feature들로 ‘emotion’을 묘사하기에는 불충분하다.

(single frame은 ‘emotion’과 ‘speech’ 모두의 영향을 가지고 있기 때문)

이는 ‘emotional cues’에 의해 ‘speech-induced articulation’을 잘못 유발할 수도 있다.
따라서, ‘emotion features’는 시간에 따른 정보들을 종합해야 한다.
이를 해결하기 위해, 논문은 ‘lightweight transformer-based emotion classifier’를 training했다.

⇒ ‘emotion features sequence($\epsilon^{1:T} \in \R^{T \times |\epsilon|}$)’을 input으로 받고, ‘video emotion classification vector ($e \in \R^8$)’와 ‘video emotion feature($\phi \in \R^{|\phi|}$)’를 output으로 내보낸다.

‘video emotion feature($\phi$)’는 classification 전, 마지막 transformer layer에서 만들어진 ‘sequence-aggregated feature’이다.($|\phi|=256$)
위 과정(video motion feature extraction)을 식으로 나타내면 아래와 같다.
- $E^{vid}_{emo}(\epsilon^{1:T}) \rightarrow (e,\phi)$

Speech feature extraction

‘audio signal’을 encoding하기 위해서, 논문은 pretrained ‘Wav2Vec2.0(ASR network)’를 사용하였다.(input으로 16kHz로 sampled된 ‘raw waveform’을 사용)
위 ‘waveform’은 먼저 ‘temporal convolutional layers’를 통과한다.

⇒ 50Hz로 sampled된 feature를 생성

‘input videos’의 ‘frame-rate’와 matching될 수 있도록, ‘linear interpolation’을 사용하여 feature를 downsampling한다.
위 ‘resampled feature’는 “Wav2Vec 2.0”의 ‘transformer-based part’에 입력된다.

⇒ ‘speech feature’를 생성

공식으로 나타내면 아래와 같다.
- $A(w) \rightarrow s^{1:T}$
  - $A$는 “Wav2Vec 2.0” network
  - $w$는 ‘raw waveform’
  - $s^{1:T} \in \R^{T \times d_s}$는 25Hz로 resampling된 ‘speech feature’
    - $T$는 frame의 수
    - 각 frame의 dimension은 $d_s=768$

4. Method

Motivation

EMOTE는 ‘two-step pipline’을 따른다.
- 먼저, ‘temporal variational autoencoder’를 training하고
- 그 후에, autoencoder의 ‘latent space’를 ‘motion prior’로 사용한다.
- 구체적으로, ‘speech audio’를 주어진 ‘target emotion’, ‘target emotion의 intensity(mild, medium, hard)’, ‘subject-specific speaking style’에 conditioning된 ‘motion prior’의 ‘latent space’로 mapping하는 ‘regressor’를 training한다.

4.1. Facial Motion Prior: FLINT

‘facial motion’은 복잡하고, ‘facial motion’을 modeling하는 것은 challenge하다.
위 문제를 간단하게 하기 위해, ‘facial motion’을 학습된 ‘low-dimensional representation’으로 나타낸다.
기반으로, sequence의 $T$ frames의 각 frame에서 “FLAME”을 사용하여 ‘face’를 represent한다.
- frame당 $|\psi|+|\theta_{jaw}|=53$ dimensions(50 expression parameters, 3 jaw pose parameters)
하지만, ‘facial motions’는 frame들 사이에서 independent하지 않다.(연관되어 있음)
따라서, sequence는 ‘lower-dimensional space’로 나타낼 수 있다.
위를 위해서, 논문은 “FLINT(FLAME IN Time)”이라 불리는 ‘temporal variational autoencoder’를 training한다. ⇒ ‘facial motion sequences’를 representation함
formulation은 ‘transformer encoder’를 사용한다.

⇒ VAE framework를 ‘temporal modeling’문제로 확장하기 위해

논문에서는 “EMOTE”를 training할 때, “FLINT”를 ‘prior’로 사용하고, 이것이 ‘high-frequency jitter’와 부자연스러운 ‘jaw rotations’를 줄인다는 것을 발견했다.

Architecture

‘encoder’가 ‘$T$개 frames ($\psi^{1:T},\theta^{1:T}_{jaw}$)’를 ‘$T/q$개 latent frames $z^{1:T/q}$’로 압축한다.
- $q$는 하나의 ‘latent frame’을 요약한 연이은 ‘original frames’의 개수이다.
연이은 ‘latent frame’은 서로 겹치지 않는다.
논문에서는 실험적으로 $q=8$로 설정하였다.
위 과정을 식으로 나타내면 위 식(1)과 같다.

‘VAE reparametrization’을 사용하면 ‘final latent sequence’가 나온다.
- Final latent sequence: $z^t=\sigma^t * z^t_s+\mu^t$
  - $z^t$는 하나의 latent frame
  - $z^t_s$는 $\mathcal{N}(0,I)$에 의해 sampling된 latent frame
위 ‘reparametrization’은 ‘final latent sequence($z^{1:T/q}$)’가 완전히 구성되기 전까지 각각의 ‘latent frame($t \in {1,…,T/q}$)’마다 따로 수행된다.
그 후 ‘final latent sequence($z^{1:T/q}$)’는 ‘original space(original frame)’으로 다시 decoding된다.
그 과정을 나타낸 식은 위 식(2)와 같다.

‘autoencoder’ architecture의 전체적인 outline은 Fig2에 있다.

Figure 2

“FLINT motion prior” Architecture
$T$개의 “FLAME” parameters sequence가 주어졌을 때, ‘encoder’는 ‘$T$ frames FLAME parameters sequence’를 압축된 ‘latents sequence’로 mapping한다.
그 후 ‘decoder’는 위 ‘latent sequence’를 다시 ‘$T$ frames FLAME parameters sequence’로 reconstructtion한다.
예측된 ‘means, sigmas의 sequence’로 ‘latent sequence’를 sampling할 때, ‘reparametrization’을 사용한다.

Losses

“FLINT”를 training할 때 위 Loss fuction 식(3)을 사용한다.

Reconstruction loss

각 ‘frame $t$’마다, ‘pseudo-GT’와 ‘predicted meshes’사이의 MSE를 계산한다. (식(4))
‘vertex coordinates $V^t,\hat{V}^t$’는 ‘GT’와 “FLAME”을 통한 ‘reconstructed parameters’를 통해 생성된다.

KL divergence

sequence의 각 ‘latent frame’마다, ‘standard VAE KL divergence term’을 계산한다.(식(5))
각 ‘latent means, sigmas’는 sequence단위로 이용되는 것이 아니라, 각각 계산된다.

4.2. Emotional Speech-Driven Animation: EMOTE

Figure 3

EMOTE architecture
‘speech input’은 ‘raw audion waveform’과 ‘conditioning’으로 구성되어 있다.
- ‘conditioning’은 ‘training speaker ID의 one-hot vectors’와 ‘emotion class, intensity’를 포함
‘audio’는 ‘Wav2Vec 2.0’을 통해 encoding되고, ‘input conditioning’은 ‘linear styling layer’에 의해 mapping된다.
위 두 feature는 concat되고, ‘latent space sequence’로 mapping하기 위해 추가적인 ‘convolutional layer’를 통과한다.
‘pretrained frozen FLINT decoder’는 위 ‘latent space sequence’를 ‘FLAME parameter sequence’로 다시 변환한다.
최종적으로 “FLAME”을 통해 mesh들이 생성된다.

Architecture

“EMOTE”는 ‘encoder-decoder architecture’이다.
‘encoder’로 “Wav2Vec 2.0”을 사용하여 ‘audio feature sequence’를 extract한다.
- $A(w)=s^{1:T}$
각각의 ‘extracted audio feature $s^t$’는 ‘style vector’와 concat된다.
- $s^{1:T}_s=[S(c)^{1:T}|s^{1:T}]$
  - $S(c)$는 ‘styling function’⇒ ‘input condition($c$)’의 ‘linear projection’
training시에, $c$는 ‘emotion type’, ‘emotion intensity’, ‘speaker ID’의 Ground Truth이다.
- $c=[c_{emo}|c_{int}|c_{id}]$
  - $c_{emo},c_{int},c_{id}$는 각각 ‘emotion’, ‘intensity’, ‘identity’ index의 ‘one-hot vectors’이다.

test시에, $c$는 수동으로 설정 가능하다.

⇒ output sequence의 ‘emotion’을 animator가 조절할 수 있게 해줌

‘style’이 포함된 후에, ‘speech feature’는 ‘motion prior’의 ‘latent space’로 mapping된다.
구체적으로, ‘speech feature’는 ‘$q$개의 연속된 frames’와 concat된 후, ‘linear layer’를 통해 하나의 ‘latent frame’으로 projection됨으로써, 일시적으로 downsampling되는 것이다.
- $SQUASH(s^{1:T}_s)=z^{1:T/q}$
최종적으로, 얻어진 ‘latent sequence $z^{1:T/q}$’는 ‘pretrained, frozen motion decoder’의 input으로 들어간디ㅏ.

⇒ Eq2를 사용하여 ‘FLAME parameters’를 output으로 내보낸다.

(예측된 ‘FLAME parameters’는 $\hat{\psi^{1:T}},\hat{\theta^{1:T}_{jaw}}$)

training동안, ‘GT geometry’와 ‘predicted geometry’는 ’differentiable renderer’로 rendering된다. 그리고, images는 ‘lip-reading network $E_{lip}$’과 ‘video emotion network $E^{vid}_{emo}$’를 통과한다.
- ‘differentiable renderer’이란 renderer의 gradient가 전파 가능하다는 것을 의미
differentiable rendering을 포함한 ‘forward pass’와 emotion과 lip-reading features의 ‘extraction’은 위 식(7)과 같이 나타낼 수 있다.
$\hat{V}^{1:T}$는 ‘generated vertex sequence’, $\hat{\eta}^{1:T}$는 ‘lip-reading features’의 sequence, $\phi$는 ‘video emotion feature’

최근 ‘transformer-based SOTA method’들과 달리, “EMOTE”는 ‘autoregressive’하지 않다.
따라서 ‘decoder’는 한번만 호출된다.⇒ ‘decoding’ 계산 복잡도는 $O(1)$

⇒ 이는 “FaceFormer”와 “CodeTalker”의 $O(T)$ autoregressive decoding loop보다 효율적이다.

또한 “BERT”와 유사하게 ‘bidirectional decoding’을 사용하여 ‘future context’를 고려할 수도 있다.

Training

Training시에, model을 supervised learning방식으로 위 식(8)의 loss function을 사용한다.

Reconstruction loss

sequence의 각 ‘frame $t$’마다, ‘pseudo-GT mesh’와 ‘predicted mesh’의 MSE를 계산한다.

Video emotion loss

논문은 ‘original video’로 부터 ‘emotion feature’를 extract한다.
- $E^{vid}{emo}(E^{im}{emo}(I^{1:T}))=\phi$
마찬가지로 ‘differentiably-rendered predicted sequence’로부터 같은 방식으로 $\hat{\phi}$를 구한다.
위 ‘emotional content’는 동일해야 하기 때문에, 그들의 distance에 대해 penalize한다.
$d_e$는 ‘negative cosine similarity’

Lip-reading loss

sequence의 각 ‘frame $t$’마다, ‘perceptual lip-reading loss’를 계산한다.
‘mouth region’을 crop하고, 이를 ‘lip-reading network’에 feed한다.
$E_{lip}$을 사용하여 frame마다 ‘lip-reading feature’를 extract하고, ‘pseudo GT lip-reading features’와 ‘predicted lip-reading features’ 사이의 distance를 계산한다.
위 식(11)과 같다.

Figure 4

Disentanglement mechanism
training시에, batch들을 복사하고, 복사된 batch에 conditioning된 ‘emotion’을 교환한다.
‘augmented batch’는 model을 통과하고, ‘augmented batch’가 ‘desired emotion(exchanged emotion)’을 갖지만, ‘original articulation’을 유지하도록 ‘disentanglement losses’를 계산한다.

Disentangling emotion and content

논문의 목표는 ‘content’와 ‘emotion’을 disentangling하는 것이다.

⇒ 둘 중 하나를 유지하면서, 다른 하나를 조절할 수 있다.

식(6)에 설명된 ‘conditioning’은 위 목표를 달성하기에 불충분하다.
따라서 논문은 새로운 ‘emotion-content disentanglement mechanism’을 고안했다.(Fig4)
training시에, 서로 다른 emotion을 갖는 2개의 sequence를 가지고, 그들의 ‘emotion condition’을 서로 교환한다.
decoding된 각각의 결과들의 ‘lip-reading loss’는 ‘emotion’이 바뀌었더라도 ‘original’과 matching되어야 한다.
하지만, decoding된 결과의 ‘emotion’은 새로운 ‘conditioning’에 matching되어야 한다.

조금 더 공식적으로 나타내면,
$EMOTE(s^{1:T}_i,c_i)=(\hat{V}^{1:T}_i,\hat{\eta}^{1:T}_i,\hat{\phi}_i)$
- ‘sample $i$’에 대한 EMOTE의 ‘forward pass’
$EMOTE(s^{1:T}_j,c_j)$는 ‘sample j’에 대한 식
위 식에서, ‘$i\leftrightarrow j$’의 의미는 ‘audio $i$’의 ‘emotion, intensity condition’을 ‘audio $j$’와 함께 사용하여 generation한다는 의미이다.

Disentanglement losses

논문은 ‘emotion losses’와 ‘lip-reading perceptual losses’를 모두 ‘augmented samples’에 적용하였다.⇒ 식(12)
‘emotion’을 ‘per-frame phenomenon’보다는 sequence 전반에 발생하는 현상으로 다루기 때문에, ‘emotion features $\phi_{i \leftrightarrow j}, \phi_i$’ 사이의 ‘temporal alignment’를 하지 않아도 된다.

Paper, 3D Talking Head Generation

Paper 3D Talking Head Generation

This post is licensed under CC BY 4.0 by the author.

Abstract

3. Background and Notation

Face model

Emotion feature extraction

Video emotion feature extraction

Speech feature extraction

4. Method

Motivation

4.1. Facial Motion Prior: FLINT

Architecture

Losses

Reconstruction loss

KL divergence

4.2. Emotional Speech-Driven Animation: EMOTE

Architecture

Training

Reconstruction loss

Video emotion loss

Lip-reading loss

Disentangling emotion and content

Disentanglement losses

Trending Tags