Learning an Animatable Detailed 3D Face Model from In-The-Wild Images(DECA) 논문 리뷰 | Gihun Son

Learning an Animatable Detailed 3D Face Model from In-The-Wild Images(DECA) 논문 리뷰

Posted Nov 16, 2023

By Gihun Son 45 min read

Abstract

기존 monocular 3D face reconstruction method는 geometric details를 꽤 잘 복원하지만, 몇몇의 한계점이 존재한다. → 몇몇의 method들은 expression(표정)에 따라 주름이 어떻게 변화하는지 modeling하지 않기 때문에 사실적으로 움직이지 못하는 face를 생성한다. → 다른 method들은 high-quality fcae scans으로 학습이 되지만, 실제 image들에 일반화가 잘 되지 않는다.
논문은 개인마다 specific하지만, 표정에 따라 달라지는 3D face shape와 animatable(생동적인) detail을 regression하는 첫 접근법을 제시한다.
논문의 모델 DECA(Detailed Expression Capture and Animation)은 person-specific detail parameters와 generic expression parameters로 구성된 low-dimensional latent representation을 통해 robust하게 UV displacement map을 만들어내도록 학습된다. → Regressor는 single image로부터 detail, shape, albedo(반사도), expression, pose, illumination(채도) parameter들을 학습한다.

UV displacement map: 3D image를 2D로 변환하기 위해 사용한다. 3D image는 좌표가 x, y, z인 3차원으로 이루어져 있지만, texture는 2차원(x, y)로 표현되기 때문에, 이를 UV map으로 나타내는 것

이를 가능하게 하기 위해, 논문은 새로운 detail-consistency loss를 소개한다. ⇒ expression에 의한 주름으로부터 person-specific detail을 구별해준다.
이러한 구별은 바뀌지 않는 person-specific details는 유지하면서, expression(표정) parameter를 조절하여 현실적인 person-specific wrinkles를 synthesize한다.
DECA는 3D supervision과 paired되지 않은(unlabeled) 실제 image들을 통해 학습하고, shape reconstruction에서 SOTA의 성능을 갖는다.

1. Introduction

single image로 3D facial geometry를 reconstruction한 기술이 나온지 20년이 되었다. 그 이후 3D face reconstruction method는 빠르게 발전하였고, 3D avatar creation, video editing 등의 기술들에 적용이 가능해졌다. 문제를 다루기 쉽게 만들기 위해, 가장 많이 존재하는 method는 pre-computed 3D face models를 이용하여, geometry 또는 appearance의 이전 지식을 포함한다. 이 모델들은 coarse(거친) face shape을 만들지만, 사실성에 필수적이고 human emotion분석에 도움이 되는 geometric details (ex. expression-dependent wrinkles)를 잡아내지 못한다.
몇몇의 method들은 facial geometry의 detail을 복원하지만, high-quality training scan이 필요하거나, occlusion에 취약하다. 어떤 연구들도 expression의 변화에 따른 wrinkle변화의 복원에 대해서는 탐구하지 않았다. expression-dependent detail을 학습하는 이전 method들은 모두 detailed 3D scans를 training data로 사용하고, 결과적으로 일반적인 image들에 generalization하지 않거나, expression-dependent details를 geometry가 아닌 appearance map의 일부로 modeling하여 realistic mesh relighting을 방지한다.

논문은 DECA(Detailed Expression Capture and Animation)을 소개한다. ⇒ 2D-to-3D supervision없는 in-the-wild(일반적인) image로부터 animatable displacement model을 학습한다.
이전 연구들과 대조적으로, 이 animatable expression-dependent wrinkles는 각 개인에 특화되어 있고, 하나의 image로부터 regression된다.
구체적으로 DECA는 1) subject-specific detail parameter와 expression parameter들로 구성된 low-dimensional representation을 통해 UV displacement map을 만드는 Geometric detail model 2) subject-specific detail, albedo, shape, expression, pose, lighting 등의 parameter를 image로부터 예측하는 Regressor 를 같이 학습한다.

⇒ detail model은 FLAME모델의 coarse geometry를 기반으로 한다. ⇒ 논문은 displacement(변위)를 subject-specific detail parameter와 FLAME의 jaw(턱) pose와 expression parameter의 함수로 표현한다.

이 것은 single image로부터 쉽게 avatar creation과 같은 중요한 분야를 가능하게 한다.
이전의 method들이 image에서 detailed geometry를 capture할 수 있었지만, 대부분의 분야들은 생동감있는(animated) face를 필요로 한다. →input image에서 정확한 geometry를 복원하는 것만으로는 충분하지 않다.
논문은 detailed geometry를 animate하게 할 수 있어야 하고, 특히 detail들이 person specific해야 한다고 주장.

person-specific details(ex mole(점), pore(모공), eyebrow, expression-dependent wrinkles)을 유지하면서, reconstructed face에서 expression-dependent wrinkles 통제능력을 얻기 위해서는, person-specific details와 expression-dependent wrinkles가 반드시 구별(disentangled)되어야 한다.
논문의 주요 기여는 새로운 detail consistency loss이다. →disentanglement를 강제한다.
training시에, 서로 다른 expression을 갖는 같은 사람의 image 2장이 주어지면, 3D face shape와 person-specific details가 2개의 image에서 같지만 expression과 expression에 따른 wrinkle intensity가 다르다는 것을 발견했다.
논문은 training에서의 이러한 특성을 이용한다. ⇒ 같은 사람의 다른 사진 사이의 detail code를 바꾼다. ⇒ 새롭게 만들어진 결과가 원래의 image와 유사하게 보이도록 강제한다.
한번의 training으로, DECA는 single image로부터 detailed 3D face를 real time에 reconstruct하고(Fig1, 3번째 줄), realistic adaptive expression wrinkle이 있는 reconstruction을 animate하게 만들 수 있다(Fig1, 5번째 줄).

Fig 1.

[Fig1에 대한 설명]

1번째 줄: Sample image

2번째 줄: Coarse shape Regression

3번째 줄: Detail shape

4번째 줄: Coarse shape reposing

5번째 줄: Reposing with person-specific details

6번째 줄: Source expression은 DECA에 의해 추출되어 표현되는 것

즉, DECA를 통해 Reposing을 진행하는 것이다.

요약하자면, 논문의 주요 기여도는 다음과 같다.

1) 일반 images를 통해 animatable displacement model을 학습하는 첫 접근법 →expression parameter들을 변화시킴으로써 그럴듯한 geometric details를 synthesize한다.

2) 새로운 detail cosistency loss

→ identity-dependent, expression-dependent한 facial details를 구별한다.

3) 기존 기법들과는 다르게 geometric details를 Reconstruction

→ Occlusion, wide pose variation, ilumination variation에 강인하다.

→ 이는, 논문의 low-dimensional detail representation, detail dientanglement, 일반 images의 large dataset을 이용한 training을 통해 가능할 수 있다.

4) 2개의 benchmark에서 SOTA달성

5) 코드와 모델을 연구목적으로 사용할 수 있음

visual input으로부터의 3D face reconstruction은 지난 10년간 상당한 관심을 받아왔다. (multi-view image로부터 3D face reconstruction을 한 첫번째 method가 소개된 이후로)
관련 연구의 큰 틀은 다양한 input modality(ex multi-view images, video data, RGB-D data, subject-specific image collection)로부터 3D face를 reconstruction하는 것이었지만, 논문은 하나의 RGB image만을 사용한 방법에 중점을 두었다.

Coarse Reconstruction

많은 monocular 3D face reconstruction method는 Vetter와 Blanz를 따른다. → analysis-by-synthesis fashion의 pre-computed statistical(통계적) model의 coefficient를 예측
이러한 method들은 optimization-based 또는 learning based method로 분류될 수 있다.
위 method들은 statistical face model의 parameter들을 예측한다. → low-frequency shape정보를 capture하는 fixed linear shape space에서
이는 매우 smooth한 reconstruction을 한다.

몇몇의 연구들은 model-free하고, 직접적으로 3D face voxel이나 mesh(그물망)을 regression하고, 결과적으로 model-based method들보다 더 다양하게 capture할 수 있다.
하지만, 하지만 이와 같은 methode들은 정확한 3D supervision이 필요하다. → optimization-based model fitting 또는 statistical face model을 sampling함으로써 생성된 synthetic data로부터 제공된다.
따라서 coarse shape variation만을 capture할 수 있다.

high-frequency geometric detail를 capture하는 대신, 어떤 method들은 figh-fidelity textures에 따라 coarse facial geometry를 reconstruct한다.
이 shading detail을 texture에 “bake”하면, lighting changes가 이 details에 영향을 미치지 못한다. ⇒ realism과 application의 범위에 제한이된다. (bake: bake는 Computer Graphics에서 세부 정보를 texture에 rendering한다는 것을 의미, baking은 render mapping이라고도 부른다)
animation과 relighting이 가능하도록, DECA는 이러한 details를 geometry의 일부로 capture한다.

Detail Reconstruction

연구의 다른 주요부분은 “mid-frequency” detail을 갖는 face를 reconstruction하는 것이다.
일반적인 optimization-based method들은 statistical face model을 image에 fit하여 coarse shape estimate를 획득한다. → shape from shading(SfS) meothd를 따라, single image 또는 video로부터 facial deatils를 reconstruct한다.
DECA와 다르게, 이러한 방법들은 느리고, occlusion에 취약하며, coarse model fitting step은 facial landmarks가 필요하다. → 많은 viewing angles와 occlusion에 대한 오류가 발생하기 쉽게 만든다.

대부분의 regression-based approach들은 coarse shape를 얻기 위한 statistical face model의 parameter들은 reconstruction하는 비슷한 접근을 한다. 그 후 localized detail을 capture하기 위한 refinement step을 진행
‘Chen’과 ‘Cao’는 high resolution scan으로부터 local wrinkle statistic을 계산하고, image나 video로부터 fine-scale detail reconstruction을 위해 사용
‘Guo’와 ‘Richardson’은 직접적으로 per-pixel displacement map을 regression한다.
위 모든 method들은 non-occluded region(occlusion이 없는)에서 fine-scale detail을 reconstruction한다. ⇒ occlusion이 있는 곳에서 visible artifacts(noise)를 발생시킬 수 있다.
‘Tran’은 face segmentation method를 적용하여 occluded region을 알아냄으로써 occlusion에 강인한 성능을 얻었고, occluded region을 다루는 것에example-based hole filling approach를 사용하였다.
더 나아가, coarse reconstruction에 detail을 추가하기 위해 detailed mesh 또는 surface normal(법선 벡터)를 직접적으로 reconstruction하는 model-free method가 존재한다.

mesh

객체의 외관을 생성하기위해 연결된 다각형(Polygon)들의 집합

vertex(점)가 모여 Polygon(다각형)이 되고, Polygon이 모여 Mesh가 되는 것이다.

‘Tran’과 ‘Tewari’는 statistical face model을 jointly learning하고, image로부터 3D face를 reconstruction한다. ⇒ jointly learning(jointly training): 여러개의 loss를 더하여 여러 task를 한번에 학습하는 방법

→ fixed statistical model보다 flexibility를 제공하면, 이 method들은 다른 detail reconstruction method들보다 한정된 geometric detail을 capture한다.

‘Lattas’는 diffuse normal과 specular normal을 추론하기 위해 image translation networks를 사용한다. ⇒ realistic rendering이 가능 → DECA와 다르게, detail reconstruction method들은 animatable details를 제공하지 않는다.

Animatable Detail Reconstruction

대부분 DECA와 관련있는 것은 animation이 존재하는 detailed face reconstruction method이다.
존재하는 method들은 wrinkles 또는 나이와 성별과 같은 특성들, high-quality 3D face mesh의 pose 또는 expression의 연관성을 학습한다.
‘Fyffe’는 dynamic video frame으로부터 계산된 optical flow correspondence를 static high-resolution을 animate하는 것에 사용한다.
대조적으로, DECA는 paired 3D training data없이 일반적인 사진 하나만으로 animatable detail model을 학습한다.
FaceScape모델은 single image로부터 animatable 3D face를 예측하지만, occlusion에 취약하다. ⇒ 이것은 2개의 reconstruction process때문이다.
1. coarse shape optimization
2. coarse reconstruction에서 extract된 texture map으로부터 displacement map을 예측한다.

‘Chaudhuri’는 dynamic(expression-dependent) albedo map에 따라 identity와 expression을 바로잡는 blendshapes를 학습한다. (blendshapes란 mesh의 virtex를 변형시켜 shape에 변형을 주는 것) → 위 방법에서는 geometric detail을 albedo map의 일부분으로 modeling하여, 이 details의 명암이 다양한 빛 변화에 조정되지 못한다. 이는 unrealistic rendering을 유발한다.
반면 DECA는 details를 geometric displacements로 modeling하였다. → 빛이 바뀌어도 자연스럽게 보인다.

요약하자면, DECA는 유일무이하다.
DECA는 single image를 input으로 사용하고, 현실성 있게 animate되게 해주는 person-specific details를 만들어낸다.
몇몇의 method들은 higher-frequency pixel-aligned detail을 만들지만, animatable하지 않다.
여전히 다른 method들은 high-resolution scan이 training에 필요하다.
논문은 high-resolution scan이 필수적이지 않다는 것을 보여주고, paired 3D ground truth가 없는 2D image만으로도 animatable details를 training할 수 있다는 것을 보여준다.
이것은 편리할 뿐만 아니라, DECA가 현실 속 변화의 넓은 다양성을 robust하게 학습할 수 있음을 의미한다.
DECA의 elements는 잘 알려진 원리(‘Vetter and Blanz’부터)를 기반으로 만들어졌지만, 논문의 기여도는 새롭고 필수적이라는 것을 강조한다. ⇒ DECA가 동작할 수 있게 하는 주요한 점은 detail consistency loss이다.(이전에 없었다)

3. Preliminaries

FLAME은 ‘분리된 linear identity shape과 expression space의 linear blend skinning(LBS)’와 neck, jaw, eyeball을 잘 표현하는 pose-dependent corrective blendshapes를 결합한 statistical 3D head model이다.

Linear Blend Skinning(LBS)

LBS는 3차원 물체를 컴퓨터 그래픽으로 형상화시킬 때에 사람의 skeleton structure(뼈 구조)로부터 Mesh를 만드는 방법이다.

장점: 연산량이 적고 적당한 mesh생성 가능, Translation이나 Rotation, Scaling과 같은 Transformation을 바로 적용 가능

단점: 접히거나 꼬이는 부분에 대한 표현이 부자연스러움

각 joint의 영향을 받는 mesh는 각각의 transformation에 따라 동작
Linear Blend Skinning 알고리즘이 사용되는 부분은 1, 2번 joint가 모두 영향을 미치는 mesh이다.

각 joint에 rotation을 주었을 때, 1, 2번 joint 모두 영향을 받는 mesh의 vertex위치를 계산해야 한다. (LBS)

joint의 weight map을 통해 vertex의 위치를 계산하는데, weight map에 대한 자세한 내용은 추후에 공부해보려고 한다.

facial identity(β), pose(θ)(k는 neck, jaw, eyeball의 4개의 joint), expression(ψ)의 parameter들이 주어졌을 때, FLAME은 n=5023개의 vertex를 갖는 mesh를 반환한다. 모델은 식(1)과 같이 정의된다.

blend skinning function: W(T, J, θ, w) ⇒ T: vertex rotation parameter ⇒ J: joint location ⇒ w: blendweights (linearly smooth)
J는 identity(β)의 function으로 정의된다.

T는 “zero pose”를 의미한다. → shape blendshapes $B_s(β;S)$, pose correctives $B_p(θ;P)$, expression blendshapes $B_E(ψ;ϵ)$를 통해 shape이 추가된다. (S, P, ϵ에 기반을 두는 학습된 identity, pose, expression)

FLAME은 appearance model이 없다. ⇒논문은 Basel Face Model의 linear albedo subspace를 FLAME UV layout으로 변환시켰다.
appearance model은 UV albedo map(A(α))을 출력한다.(α는 albedo parameters)

in-the-wild face datasets에 존재하는 사진들은 보통 distance의 질을 떨어뜨린다.
따라서 논문은 orthographic(정사영) camera model(c)를 사용한다. ⇒ 3D mesh를 image space로 투영시킨다.
Face vertex들은 $v = 𝑠Π(𝑀_𝑖)+t$에 따라 image로 project된다. $M_i$: M에 존재하는 vertex $Π$: orthographic(정사영) 3D-2D projection matrix $s, t$: isotropic(등방성) scale과 2D translation을 의미한다.(s, t는 c로 요약됨)

Face reconstruction에서 가장 자주 사용되는 illumination model은 Spherical Harmonics(SH) 을 기반으로 한다. ⇒ illumination model: 광원에 의한 pixel의 밝기를 계산하는 방법
light source(광원)이 멀리 떨어져 있고, face의 surface reflectance(표면 반사)가 Lambertian이고 가정할 때, shaded face image는 다음과 같이 계산된다.

⇒ Lambertian Surface: 어떤 방향에서 보아도 똑같은 밝기 값을 보이는 표면. 보통 움직이며 같은 지점을 바라볼 때, 밝기 값이 변화한다. 하지만 Lambertian surface는 바뀌지 않는 것으로 가정한다. 즉 모든 방향으로 같은 양의 빛을 반사한다고 가정하는 것. (A4용지와 같이 가공되지 않은 표면은 Lambertian surface에 가깝다고 볼 수 있지만, 차 표면과 같이 매끈한 것은 Lambertian surface와 다르다.

albedo(A), surface normal(N), shaded texture(B) 는 UV coordinates에 존재하고, (i,j)는 UV coordinate의 pixel위치를 의미한다. SH(Spherical Harmonics)의 basis와 coefficients는 $H_k$와 $l$로 정의되어 있다. (평면의 bias는 (1,0), (0,1)이다). ⊙는 Hadamard product를 의미한다.

Hadamard product

geometry parameters $(β,θ, ψ)$, albedo(α), lighting($l$), camera information(c)가 주어졌을 때, 2D image($I_r$)을 생성할 수 있다. → $I_r=R(M, B, c)$ ($R$은 rendering function)
FLAME은 low-dimensional latent space로부터 다양한 poses, shapes, expressions를 갖는 face geometry를 생성할 수 있다.
하지만 model의 representational power는 low mesh resolution에 의해 제한되고, 그로 인해 mid-frequency detail이 FLAME의 surface에서 대부분 소멸된다.
다음 section에서는 논문의 expression-dependent displacement model을 소개한다. ⇒mid-frequency detail이 있는 FLAME을 입증한다.
또한, single image로부터 geometry를 어떻게 reconstruct하고, animatable하게 만드는지 증명할 것이다.

4. Method

DECA는 in-the-wild training images을 단독으로 사용하여 geometric detail이 존재하는 parameterized face model을 regression하도록 학습된다. (Fig2, Left)
training 후에, DECA는 single face image($I$)로부터 detailed face geometry가 있는 3D head를 reconstruct할 수 있다.(Fig2, Right)
reconstructed details의 학습된 parameterization(매개변수화)는 detail reconstruction이 animatable하게 만들 수 있다. → FLAME의 expression과 jaw pose parameter를 조절함으로써
위는 person-specific detail이 바뀌지 않도록 유지하면서 새로운 wrinkles를 synthesize할 수 있다.

[Fig2에 대한 설명]

Training 시에(left box),

DECA는 각 image별로 face shape reconstruction을 위한 parameter들을 예측한다. ⇒ shape consistency information의 정보를 참고하여(파란색 화살표)
그 후, expression-conditioned displacement model을 학습한다. ⇒ 같은 individual의 여러 image들로부터 얻은 detail consistency information을 활용하여(빨간색 화살표)
지금까지는 analysis-by-synthesis pipline은 기존과 같지만, Fig2 left의 노란색 박스 부분은 논문의 새로운 key이다. ⇒ 위 displacement consistency loss는 Fig3에 더 자세히 나와있다.

Training된 후에(right box),

DECA는 reconstructed source identity의 shape, head pose, detail code를 reconstructed source expression의 jaw pose, expression parameter와 결합함으로써 face를 animatable하게 만든다. ⇒ animated coarse shape과 animated displacement map을 얻기 위한 연산
최종적으로 DECA는 animated detail shape을 출력한다.

(위 이미지들은 NoW Dataset에서 가져온 것이지만, DECA를 학습시킨 것이 아니라 illustration의 목적으로 사용된 것)

결국 기존 논문들과 차별점은 displacement map을 통해 detail을 학습하는 부분인 것 같다.

Detail Consistency Loss: 논문에서 강조하는 부분(새로운 method)
DECA는 같은 사람의 여러 image들을 training에 사용한다. ⇒ expression-dependent detail(표정에 따른 detail)로부터 static person-specific details를 구별하기 위해
적절하게 factored된다면, 한 사람의 하나의 이미지로부터 detail code를 얻을 수 있고, 이를 통해 different expression을 갖는 그 사람의 다른 image를 reconstruct하는 것에 사용할 수 있다.

Key Idea

DECA의 key idea는 각 individual의 face가 서로 다른 details(i.e. wrinkles)을 보인다는 것을 발견한 것에 기반을 둔다. ⇒ facial expression에 따라 details가 달라지지만, shape의 다른 properties(특징)는 바뀌지 않는다.
따라서, facial details는 static person-specific details와 dynamic expression dependent details(i.e. wrinkles)로 구별되어야 한다.

⇒ static person-specific detail: expression에 따라 변화하는 것이 아닌, 각 individual마다 다른, 하지만 같은 individual이라면 항상 같은 detail을 의미하는 것

⇒ dynamic expression dependent detail: wrinkles와 같은 expression에 따라 변화하는 detail

하지만 static, dynamic facial details를 구별하는 것은 간단한 task가 아니다.
static facial details는 사람마다 다르지만, dynamic expression dependent facial details는 같은 사람이라도 매우 다양하다.
따라서 DECA는 expression-conditioned detail model을 학습한다. ⇒ person-specific detail latent space와 expression space 둘 다로부터 facial details를 추론하기 위함

detail displacement model을 학습시키는 것에 가장 까다로운 점은 training data가 부족하다는 것이다.
이전 연구들은 detailed facial geometry를 획득하기 위해, 통제된 환경에서 전문적인 카메라 시스템으로 사람을 scan하여 사용했다.
하지만 이 방법은 다양한 표정과 인종, 나이에 따른 다양성을 갖는 많은 identity를 capturing하는 것에 비용이 많이 들고 비실용적이다.
따라서 논문은 in-the-wild images로부터 detail geometry를 학습하는 방법을 제안한다.

4.1 Coarse Reconstruction

논문은 먼저 analysis-by-synthesis방식으로 coarse reconstruction을 학습한다. ⇒ 2D image($I)$이 input으로 주어지면, image를 latent code로 encoding하고, 2D image($I_r$)을 synthesize하기 위한 decoding을 한다. ⇒ 그 후, input과 synthesized image의 difference를 최소화한다.
Fig2에서 볼 수 있듯이, 논문은 encoder $E_c$을 low-dimensional latent code를 regression할 수 있도록 학습시킨다. ⇒ $E_c$는 FC layer가 있는 ResNet50으로 구성되어 있다.
latent code는 다음과 같이 구성된다.

⇒FLAME parameters β(100개의 FLAME shape parameters), ψ(50개의 expression parameters), θ

위 parameter들은 coarse geometry를 나타냄

⇒albedo coefficient α(50개의 albedo parameters)

⇒camera c

⇒ lighting parameters $l$

($E_c$는 총 236 dimensional latent code를 예측한다.)

subject당 multi image, corresponding identity labels($c_i$), image당 68개의 2D keypoints($k_i$) 와 함께 2D face images($I_i$) dataset이 주어졌을 때, coarse reconstruction branch는 위 식(4)를 minimize하도록 학습된다. ⇒ $L_{lmk}$: landmark loss ⇒ $L_{eye}$: eye closure loss ⇒ $L_{pho}$: photometric loss ⇒ $L_{id}$: identity loss ⇒ $L_{sc}$: shape consistency loss ⇒ $L_{reg}$: regularization

Landmark Re-projection Loss

Landmark: 얼굴에서 주요 부분(눈, 코, 턱 등)을 나타내는 좌표, 대부분 68개의 좌표로 나타낸다.

landmark loss는 ground-truth 2D face landmarks($k_i$) 와 FLAME model의 surface위의 corresponding landmarks($M_i$)(예측된 camera model에 의해 image로 project됨) 사이의 difference를 측정한다. 식(5)에서 해당 Loss를 확인할 수 있다.

Eye Closure Loss

Eye closure loss는 눈꺼풀 위 아래의 lendmark $k_i$와 $k_j$사이의 relative offset을 계산한다.
이후 image로 project된 FLAME의 surface위의 corresponding landmarks $M_i$와 $M_j$사이의 relative offset과의 difference를 측정한다. Loss의 식은 위 식(6)과 같다.

$E$는 upper/lower eyelid landmark pairs의 set이다.
landmark loss($L_{lmk}$)는 완전한 landmark location differences에 불리하도록 동작한다면, Eye closure loss($L_{eye}$)는 eyelide landmark사이의 relative difference에 대해 불리하도록 동작한다.
eye closure loss($L_{eye}$)가 translation invariant하기 때문에, projected 3D face와 image사이의 misalignment를 덜 의심할 수 있다.($L_{lmk}$와 비교하여)
대조적으로, eye landmark를 위한 landmark loss를 단순하게 증가시키는 것은 전체적인 face shape에 영향을 미칠 수 있고, 만족스럽지 못한 reconstruction으로 이어질 수 있다. (Fig 10은 eye-closure loss의 영향을 보여준다.)

⇒ 위 Fig 10을 보면 두번째 열의 그림이 $L_{eye}$를 적용하지 않은 DECA의 결과인데, 눈이 감긴 expression을 인식하지 못한다는 것을 알 수 있다.

Photometric Loss

Photometric Loss는 input image($I$)와 rendering image($I_r$)사이의 error를 계산한다. 식은 위와 같이 나타낼 수 있다.

$V_I$는 face skin region에서 value 1값, 다른 곳은 face segmentation method를 통해 획득한 value 0값을 갖는 face mask이다. ‘⊙’는 Hadamard product를 의미한다.
face region에서만 error를 계산하는 것은 hair, clothes, sunglasses 등에 의해 발생하는 occlusion에 robutst하게 동작하도록 한다.
만약 해당 Loss가 존재하지 않는다면, predicted albedo는 occluder의 color를 고려하게 된다. ⇒ occluder의 색은 피부색과 다르기 때문에 unnatural rendering을 유발한다. (Fig10 의 마지막 column에서 볼 수 있다.)

Identity Loss

최근의 3D face reconstruction method들은 identity loss를 사용하는 것이 realistic face shapes를 생성하는 것에 효과적이라는 것을 입증했다.
이에 동기를 받아, 논문은 training시에 identity loss를 적용하기 위해 pretrained face recognition network를 사용한다.

face recognition network($f$)는 rendered image와 input image의 feature embeddings를 출력한다. 그 후 identity loss는 2개의 embedding사이의 cosine similarity를 측정한다.

[cosine similarity]

Identity Loss($L_{id}$)는 위의 식(7)과 같이 나타낼 수 있다.
embedding사이의 error를 계산함으로써, loss는 rendered image가 person’s identity의 핵심적인 특성들을 capture하도록 한다. ⇒ rendered image가 input subject와 똑같은 사람처럼 보이도록 해준다. (Fig 10에서 $L_{id}$를 포함한 coarse shape의 결과는 input subject와 더 유사하게 보인다.)

Shape Consistency Loss

같은 subject(person)의 $I_i$와 $I_j$ 2개의 image가 주어졌을 때, coarse encoder($E_c$) 는 같은 shape parameters( $β_i$=$β_j$)를 출력해야 한다.
이전 연구들은 $β_i$와 $β_j$의 distance가 다른 subject에 해당하는 shape coefficient(고정된 값)보다 조금 더 작게 하여 shape consistency를 유지하도록 하였다.
하지만, fixed margin을 고르는 것은 여전히 challeng하다.
논문은 $β_i$를 $β_j$로 대체하는 새로운 전략을 제안한다.(다른 parameter들은 변하지 않도록 한다)
$β_i$와 $β_j$가 같은 subject를 나타낸다고 할 때, 위의 새로운 parameter set는 $I_i$를 잘 reconstruct할 것이다. 해당 과정은 아래 식(8)과 같이 나타낼 수 있다.

목표는 rendered image가 real person처럼 보이도록 만드는 것이다.
만약 해당 method가 같은 사람인 2개 image에서 face shape이 올바르게 예측한다면, 이 image들 사이의 shape parameter들을 swapping하더라도 rendered image가 구별되지 않도록 해야 한다.
따라서 논문은 swapped shape parameter들로부터 image를 rendering할 때, photometric loss와 identity loss를 사용한다.

Regularization

4.2 Detail Reconstruction

detail reconstruction은 detailed UV displacement map(D)를 포함한 coarse FLAME geometry를 augment한다.
coarse reconstruction과 유사하게, 논문은 encoder $E_d$($E_c$와 동일한 architecture)를 training ⇒ $I$를 128-dimensional latent code(δ)로 encoding하도록(subject-specific details가 표현됨)
그 후, latent code(δ)는 FLAME의 expression(ψ), jaw pose parameters($θ_{jaw}$)와 concatenate되고, detail decoder $F_d$에 의해 D로 decoding된다.

Detail Decoder

Detail Decoder는 위 식(9)와 같이 정의된다.
detail code(δ): static person-specific detail(expression에 따라 변화하지 않는 detail)을 조절
coarse reconstruction branch의 expression(ψ), jaw pose parameter($θ_{jaw}$): dynamic expression wrinkle details를 capture
rendering하기 위해, D는 normal map으로 변환된다. ⇒ normal map: vertex 3개가 모였을 때, 평면이 만들어지는데, 그 평면의 normal vector가 존재한다. 즉 normal map은 mesh의 normal vector들을 의미하는 것

Detail Rendering

detail displacement model은 mid-frequency surface details를 갖는 image를 생성할 수 있다.
detailed geometry($M’$)을 reconstruct하기 위해서, 논문은 $M$과 $M$의 surface noraml $N$을 UV space로 변환한다. ($M_{uv}$과 $N_{uv}$) 그 후 D와 combine한다. 식(10)에서 해당 연산을 볼 수 있다.

$M’$을 통해 noramls $N’$을 계산함으로써, normal map과 함께 $M$을 rendering하여 detail rendering $I’_r$을 얻을 수 있다. 식(11)과 같다.
detail reconstruction은 식(12)를 최소화하는 방향으로 학습된다.

$L_{phoD}$: photometric detail loss
$L_{mrf}$: ID-MRF loss
$L_{sym}$: soft symmetry loss
$L_{regD}$: detail regularization
논문의 estimated albedo는 50 basis vector를 갖는 linear model로 생성되었기 때문에, rendering된 coarse face image는 skin tone, basic facial attributes와 같은 low frequency information만을 recover할 수 있다.
rendering된 image의 hight frequency details는 주로 displacement map에서 도출된다. ⇒ 따라서, $L_{detail}$은 rendered detailed image와 real image를 비교하기 때문에, $F_d$는 detailed geometric information을 강제적으로 modeling하게 된다.

Detail Photometric Losses

적용된 detail displacement map과 함께, rendered image $I’_r$은 geometric details를 포함한다.
coarse rendering과 같이, 논문은 photometric loss($L_{phoD}$) 를 사용한다. ⇒식은 위와 같고, $V_I$는 visible skin pixel의 mask representing이다.

ID-MRF Loss

논문은 geometric details reconstruction을 위해 Implicit Diversified Markov Random Field(ID-MRF) Loss를 사용한다.
input image와 detail rendering이 주어졌을 때, ID-MRF Loss는 pre-trained network의 서로 다른 layer들로부터 feature patch들을 extract한다.
그 후, 두 이미지의 corresponding nearest neighbor feature patches사이의 difference를 최소화한다.
‘Larsen’과 ‘Isola’는 L1 loss가 data의 high frequency information을 복원하지 못한다는 것을 찾아냈다. ⇒따라서, 위 2개의 method는 high-frequency detail을 얻기 위해 discriminator를 사용하였다.
하지만, 이것은 불안정한 adversarial training process이다.
대신, ID-MRF Loss는 local patch level에서 generated content를 original input으로 regularize할 수 있다.⇒ 이것이 DECA가 high-frequency details를 capture할 수 있게 한다.

‘Wang’에 따르면, ID-MRF Loss는 VGG19의 conv3_2와 conv4_2 layer에서 계산된다.(식(13))
$L_M(layer_{th})$: VGG19의 $layer_{th}$ layer를 통해 $I’_r$과 $I$로부터 extract되는 feature patch에서 사용되는 ID-MRF Loss를 의미한다.
photometric loss에서와 마찬가지로, 논문은 $L_{mrf}$를 UV space의 face skin region을 위해서만 계산한다.

Soft Symmetry Loss

self-occlusions(아마 머리카락과 같은 요소들에 의한 occlusion을 의미하는 것 같다.)에 robustness를 추가하기 위해, 논문은 non-visible face parts를 regularize하기 위한 soft symmetry loss를 추가하였다.(식(14)를 minimize하도록 학습)
$V_{uv}$: UV space의 face skin mask
$flip$: horizontal flip operation
극단적인 경우, $L_{sym}$이 없으면, boundary artifcats(경계 부분의 noise)가 occluded region에 보인다.( Fig 9의 가장 아래 그림)

Detail Regularization

Detial regularization을 통해 noise를 줄여준다.

4.3 Detail Disentanglement

$L_{detail}$을 optimization하는 것은 mid-frequency detail을 갖는 face reconstruction이 가능하게 한다.
하지만, 위 Detail reconstruction을 animatable하게 만드는 것은 person specific details(i.g. moles, pores, eyebrows, expression-independent wrinkles) 를 구별하는 것이 필요 ⇒ person specific details는 expression-dependent wrinkles의 δ에 의해 제어되고, expression-dependent wrinkles는 FLAME의 expression, jaw parameters($ψ, θ_{jaw}$)에 의해 제어된다.
논문의 key observation은 같은 사람의 2개의 image는 유사한 coarse geometry와 personalized details를 갖고 있다는 것이다.

구체적으로, detail image를 rendering하기 위해서는, 같은 subject(person)의 2개의 image사이의 detail code를 교환하는 것은 rendered image에 영향을 미치지 않는다.(이는 Fig3에 나와있다.)
논문은 image(i)의 jaw, expression parameters 얻고, image(j)로부터 detail code를 extract하고, wrinkle detail을 예측하기 위해 이를 combine한다.
같은 사람의 서로 다른 이미지끼리 detail codes를 swap하면, 생성된 결과는 realistic할 것이다.

Detail Consistency Loss

같은 subject의 2개의 image $I_i$와 $I_j$가 주어질 때, Loss는 식(15)와 같이 정의된다.
위 식에서 i가 들어간 parameter들은 image $I_i$의 parameter들이고, $δ_j$는 image $I_j$의 detail code이다.
detail consistency loss는 identity-dependent details와 expression-dependent details를 구별하기 위해서 필수적이다.
detail consistency loss가 없이, person-specific detail code(δ)가 identity, expression dependent details를 capture하므로, reconstructed details는 FLAME jaw pose와 expression을 변화하여 re-pose될 수 없다.
- 논문은 $L_{dc}$의 필요, 효과성을 section 6.3에 말한다.

Fig 4는 DECA의 animatbale details의 효과를 보여준다.
source identity($I$)와 source expression($E$) (left) 의 image들이 주어졌을 때, DECA는 detail shapes (middle) 를 reconstruct하고, $E$의 expression을 통해 $I$의 detail shape을 animatable (right, middle) 하게 한다.
위와 같은 synthesized DECA expression은 reconstruct된 같은 subject의 reference detail shape과 거의 동일한 것을 볼 수 있다. (right,botton)
$I$의 reconstructed detail을 사용하는 대신(i.e. static details), coarse shape만을 animatable하게 만드는 것은, visible aritifacts를 유발한다.(right, top)

6.3 Ablation Experiment

Detail Consistency Loss

ID-MRF Loss

Figure 9(right)sms $L_{mrf}$가 detail reconstruction에 미치는 효과를 보여준다.
$L_{mrf}$가 없는 경우(Fig9, middle), wrinkle details가 reconstruct되지 않아, 전체적으로 smooth한 결과를 나타낸다.
$L_{mrf}$가 있는 경우(right), DECA가 wrinkle details를 capture한다.

Other Losses

논문은 eye-closure loss($L_{eye}$), segmentation on the photo metric loss, identity loss($L_{id}$)의 효과를 평가해보았다.
Fig10에서 DECA coarse model에서 위 loss들을 사용했을 때와 안했을 때를 질적으로 비교한다.

8. Conclusion

논문은 in-the-wild images로부터 animatable detail model을 학습하여, 하나의 image에서 detailed expression capture와 animation이 가능한 DECA를 소개했다.
DECA는 2D-to-3D supervison없이 2M in-the-wild face image를 통해 학습되었다.
DECA는 shape consistency loss를 통해 shape reconstruction분야에서 SOTA를 달성했다. ⇒ 새로운 detail consistency loss는 DECA가 person-specific detail에서 expression-dependent wrinkles를 구별할 수 있게 해준다.
low-dimensional detail latent space는 noise와 occlusion에 robust한 fine-scale reconstruction이 가능하게 하고, 새로운 loss는 identity detail과 expression-dependent wrinkle detail를 구별할 수 있게 해준다.

Paper, 3D Face Reconstruction

Paper 3D Face Reconstruction

This post is licensed under CC BY 4.0 by the author.

Abstract

1. Introduction

Fig 1.

2. Related Work

Coarse Reconstruction

Detail Reconstruction

mesh

Animatable Detail Reconstruction

3. Preliminaries

Linear Blend Skinning(LBS)

4. Method

Key Idea

4.1 Coarse Reconstruction

Landmark Re-projection Loss

Eye Closure Loss

Photometric Loss

Identity Loss

Shape Consistency Loss

Regularization

4.2 Detail Reconstruction

Detail Decoder

Detail Rendering

Detail Photometric Losses

ID-MRF Loss

Soft Symmetry Loss

Detail Regularization

4.3 Detail Disentanglement

Detail Consistency Loss

6.3 Ablation Experiment

Detail Consistency Loss

ID-MRF Loss

Other Losses

8. Conclusion

Trending Tags