EMOCA:Emotion Driven Monocular Face Capture and Animation 논문 리뷰 | Gihun Son

EMOCA:Emotion Driven Monocular Face Capture and Animation 논문 리뷰

Posted Dec 14, 2023

By Gihun Son 10 min read

Abstract

기존 연구들은 ‘facial expression’을 전부 capture하지 못한다.
논문은 training에 사용되는 ‘standard reconstruction metrics’(landmark reprojection error, photometric error, face recognition loss)가 높은 정확도의 expressions를 capture하기에 불충분하다는 것을 발견했다. ⇒ input image의 expression과 맞지 않는 facial geometries
논문은 이를 “EMOCA(EMOtion Capture and Animation)으로 해결하였다. ⇒ 새로운 ‘deep perceptual emotion consistency loss”
- input image에 묘사된 expression과 reconstructed 3D expression이 match되도록 도와준다.
‘valence’와 ‘arousal’의 단계를 직접적으로 regression하고, estimated 3D face parameters를 통해 ‘basic expressions’를 분류한다.
- valence(감정 가치): 감정의 긍정적, 또는 부정적인 정도를 나타낸다.
- arousal(감정의 활성화): 감정이 강렬한지, 차분한지 정도를 나타낸다.

1. Introduction

논문은 “EMOCA(EMOtion Capture and Animation)”를 설계하였다. ⇒ 3D supervision없는 in-the-wild images를 통해 animatable face model을 학습
논문은 SOTA ‘emotion recognition’ model을 training하고, EMOCA을 training할 때 supervision으로 사용하였다.
또한 EMOCA는 새로운 ‘perceptual emotion consistency loss’를 사용한다. ⇒ input과 rendered reconstruction 사이의 emotional content를 유사하게 만들어 준다.

새로운 ‘emotion consistency loss’는 더 나은 emotion을 reconstruction하지만, 이것 하나만으로는 충분하지 않다.
기존 3D reconstruction model에 사용된 large image dataset은 다양한 인종을 갖는 많은 subjects들이 있었지만, 감정 표현이 부족하다.
반면, facial expression, valence, arousar이 있는 large dataset은 emotion이 풍부하지만, 하나의 subject 당 다양한 conditions의 multiple images를 제공하지 않는다.
같은 사람의 multiple images는 SOTA 3D face reconstruction method를 사용하기 위해서 필수적이다.
이러한 문제를 해결하기 위해, EMOCA는 ‘identity shape reconstruction accuracy’에서 SOTA의 성능을 얻은 DECA를 기반으로 한다.
- 구체적으로, DECA의 architecture에 ‘facial expression’을 위한 추가적인 ‘trainable prediction branch’를 추가하였다. (다른 구조들은 그대로)
위는 emtion-rich image data를 통해 EMOCA의 expression part를 training시킬 수 있다. ⇒ emotion reconstruction performance를 향상시킨다. (DECA의 identity face shape quality를 retraining시키면서)

training한 후, EMOCA는 single image로부터 3D face를 reconstruction한다.(Fig 1)
EMOCA는 ‘reconstructed expression quality ‘관점에서 이전 sota model보다 상당한 결과를 냈다.
EMOCA는 SOTA identity shape reconstruction accuracy를 보존한다.
EMOCA를 통해 regression된 ‘expression parameter’들은 in-the-wild emotion recognition에 충분한 정보를 전달한다.

Figure 2

“Coarse training stage(green box)”
- input image가 ‘coarse shape encoder(DECA로 initialized, fixed)’, EMOCA의 ‘trainable expression shape encoder’에 feed된다.
- regressed ‘identity shape’, ‘expression shape’, ‘pose’, ‘albedo’ parameters로 “FLAME’s geomtry models”와 “albedo models”를 fixed decoders로 사용하여 ‘textured 3D mesh’를 reconstruction한다. ⇒ regressed ‘camera’, ‘spherical harmonics lighting’으로 “differentiable renderer”를 통해 만들어진다.
- 논문의 새로운 ‘emotion consistency loss(식 8)’는 ‘input image’와 ‘rendered coarse shape’의 ‘emotion features’사이의 차이에 대해 penalize한다. (위 2개의 image 모두 ‘fixed emotion recognition network’를 통과한 후에)
“Detail training stage(yellow box)”
- EMOCA’s expression encoder는 fixed된 상태이고, regressed ‘expression(and jaw-pose)’ parameters는 ‘detail decoder’의 condition으로 사용된다.

3. Preliminaries

Face model

“FLAME”은 여러 parameter들이 존재하는 ‘statistical 3D head model’이다.
- identity shape $\beta \in \R^{|beta|}$
- facial expression $\psi \in \R^{|\psi|}$
- pose parameters $\theta \in \R^{3k+3}$
  - $k$=4 joints(neck, jaw, eyballs)의 rotation과 global rotation을 위한 parameters(3*(k+1))
모든 parameter들이 주어졌을 때, FLAME은 $n_v=5023$ vertex를 갖는 mesh를 출력한다.
“FLAEM”: $M(\beta,\theta,\psi)$→$(V,F)$
- vertex $V \in \R^{n_v \times 3}$
- $n_f$=9976 faces $F \in \R^{n_f \times 3}$
FLAME은 ‘Basel Face Model의 albedo space’에서 ‘FLAME의 UV layout’으로 변환된 ‘appearance model’과 함께 사용된다.
$\alpha \in \R^{|\alpha|}$ parameters가 주어졌을 때, ‘appearance model’은 ‘FLAME texture map $A(\alpha) \in \R^{d \times d \times 3}$’을 출력한다.

Face reconstruction

Face reconstruction은 DECA와 거의 동일하다, 따라서 간단하게 언급만 하겠다. 자세한 내용은 DECA논문 리뷰에 정리되어 있다.

image($I$)를 ‘Coars Encoder($E_c$)’에 input으로 넣으면, $\beta, \theta, \psi, \alpha, l, c$ 의 parameter들이 나온다.

image($I$)를 ‘Detail Encoder($E_d$)’에 input으로 넣으면 ‘detail code($\delta$)’가 나온다.

위 ‘Coarse Encoder’의 출력, $\psi, \theta_{jaw}$와 ‘Detail Encoder’의 출력, $\delta$를 ‘Detail Decoder($F_d$)’의 input으로 넣으면 ‘expression-dependent dependent detail UV displacement map($D$)’이 생성된다.

Coarse shape으로 ‘Rendering function($R$)’을 이용해서 rendering하면 $I_{Rd}$가 나오고, ‘expression-dependent details’를 추가된 FLAME mesh를 image로 rendering하기 위해서는, D$D$$N_d$’로 변환하여 $R$의 input으로 넣어주면 된다.

Relative keypoint loss는 DECA와 동일하다.

해당 부분이 “DECA” 모델과 주요하게 다른 부분이다.
논문은 ‘emotion network’로 FC layer가 있는 ResNet-50 backbone을 사용하였다. ⇒ output은 ‘expression classification’, ‘valence’, ‘arousal’
‘emotion network’는 large scale annotated emotion dataset ‘AffectNet’을 통해 training되었다.
loss function은
- ‘expression classification’⇒ categorical cross entropy
- ‘valence’, ‘arousal’⇒ mean squared error, correlation coefficient loss
training후에는 prediction head(FC layer)는 쓰지 않는다.
training된 ‘emotion network’의 output은 ‘emotion feature($\epsilon \in \R^{|\epsilon|}$)’이다.
‘emotion network’는 $A(I)$→$\epsilon$ 으로 표기된다.

“EMOCA”의 기여
- 새로운 ‘emotion consistency loss’를 제안한다. ⇒ input과 rendered image사이에 ‘emotion similarity’를 증가시킨다.(supervision)
- DECA의 identity shape reconstruction 성능을 유지하면서, ‘emotion-rich image data’를 통해 EMOCA의 expression part만을 training한다.(나머지는 DECA의 trained model 사용)

EMOCA는 DECA의 architecture를 기반으로 한다. EMOCA는 여기서 ‘expression’을 잘 표현할 수 있도록 새로운 방법을 시도한다.(input과 동일한 rendered image가 나올 수 있도록)

‘DECA’와 같은 모델을 ‘emotion-rich image data’로 학습시키는 것은 불가능하다. ⇒ $E_c$의 ‘identity shape reconstruction’을 training할 때, regularization을 위해서는 같은 subject의 multiple training image가 필요하기 때문
따라서 “EMOCA”는 추가적인 ‘expression encoder’를 통해 DECA를 변형시켰다. ⇒ expression encoder $E_e(I)\rightarrow \psi_e$

training시에 $E_c$의 weight는 고정되어, $\beta, \theta, \alpha, l, c$에 대한 prediction은 유지된다.
하지만, DECA의 $\psi$는 사용하지 않는다.
$R(M(\beta, \theta, \psi_e),\alpha,l,c) \rightarrow I_{Re}$는 ‘input image의 expression($E_e(I)$)’와 $E_c$의 output을 rendering에 사용한 결과이다.

$E_e$만을 학습시키는 것은 많은 장점이 있다.
- subject에 대한 multiple image가 필요하지 않다.
- identity prediction을 학습하지 않아, face recognition loss를 사용하지 않는다.
- parameter들이 고정되어 landmark reprojection loss를 사용하지 않는다.
- 적은 parameter들을 학습하기 때문에 training에 자원이 감소한다.

위 Loss Function은 DECA와 거의 유사하다. 따라서 다른 부분만 자세히 살펴보겠다.

DECA와 가장 다른 부분이다. 자세히 살펴보자.
‘Emotion consistency loss’는 ‘input image의 emotion features $\epsilon_I=A(I)$’과 ‘rendered image의 emotion feature $\epsilon_{Re}=A(I_{Re})$’사이의 difference를 계산한다.
- $L_{emo}=d(\epsilon_I,\epsilon_{Re})$, ($d(\epsilon_1,\epsilon_2)=||\epsilon_1-\epsilon2||_2$)
$L_{emo}$는 geometry error를 구하는 것 대신, input image와 rendered image 사이의 ‘perceptual difference’를 계산한다.