Capture, Learning, and Synthesis of 3D Speaking Styles(VOCA) 논문 리뷰 | Gihun Son

Capture, Learning, and Synthesis of 3D Speaking Styles(VOCA) 논문 리뷰

Posted Dec 15, 2023

By Gihun Son 18 min read

Abstract

‘audio-driven 3D facial animation’은 널리 연구되고 있지만, ‘human-like performance’는 여전히 해결되지 않았다. ⇒ 사용 가능한 ‘3D dataset’, ‘models’, ‘standard evaluation metrics’가 부족하기 때문
위 문제를 해결하기 위해 60fps의 4D scans와 12 speakers의 audio가 있는 29분의 ‘4D face dataset’을 소개한다.
그 후, facial motion에서 identity를 factorize하여, 위 dataset을 통해 neural network를 training한다.
“VOCA(Voice Operated Character Animation)”은 어떠한 ‘speech signal’이라도 input으로 받을 수 있고(영어가 아닌 다른 언어의 speech라도 가능), 다양한 adult face들을 animate할 수 있다.
training동안 ‘subject labels’를 conditioning하는 것은 다양한 realistic speaking styles을 학습할 수 있게 해준다.
“VOCA”는 또한 animation하는 동안, ‘speaking style’, ‘identity-dependent facial shape’,’pose(head, jaw, eyeball rotation)’를 변경할 수 있는 “animator controls”를 제공한다.
VOCA는 retargeting없이 보이지 않는 subject에 쉽게 적용할 수 있는 유일한 ‘realistic 3D facial animation model’이다. ⇒ ‘in-game video’, ‘virtual reality avatars’, ‘speaker, speech, language 등을 모르는 상황’에 적합
dataset과 모델 모두 사용 가능하다.

1. Introduction

computer에게 face를 보고 이해하도록 가르치는 것은 human behavior을 이해하기 위해서는 중요한 문제이다.
image와 video로부터 ‘3D face shape’, ‘facial expressions’, ‘facial motion’을 estimate하는 많은 연구들이 있다.
‘sound’를 통해 ‘3D face properties’를 estimate하는 것은 상대적으로 덜 주목 받았다. ⇒ 하지만, ‘speech’를 말하는 것이 많은 ‘facial motions’의 직접적인 원인이 된다.
따라서, ‘speech’와 ‘facial motion’ 사이의 correlation을 이해하는 것은 ‘human analyzing’에 추가적인 가치있는 정보를 준다. ⇒ 특히 visual data가 noisy, missing, ambiguous 할 때
‘speech’와 ‘facial motion’ 사이의 relation은 이전부터 audio-visual speech를 분리하거나, audio-video driven facial animation에 사용되었다.
현재까지 존재하지 않았던 것은, ‘어떠한 사람의 어떠한 언어의 speech’라도 ‘어떠한 face shape의 3D facial motion’에 relate할 수 있는 general, robust한 방법이다. ⇒ VOCA가 만족

‘speech-driven 3D facial animation’이 널리 연구되고 있지만, ‘speaker-independent modeling’은 여전히 challenge하다.
1. ‘Speech signal’과 ‘facial motion’은 강하게 연관되어 있지만, 너무 다른 space에 존재한다. ⇒2개의 data를 related하기 위해서는 ‘non-linear regression function’이 필요하다. 이는 상당히 많은 training data가 필요하다는 말과 같다.
2. phonemes(음소)와 facial motion 사이의 many-to-many mapping ⇒ people, styles에 따라 training할 때 훨씬 더 challenge하다.
3. 논문은 face에 신경써서(특히 realistic face), ‘Uncanny Valley(사람과 유사하지만, 완벽히 같지는 않아 불쾌감을 주는 현상)’에 빠지지 않도록 animation이 realistic해야 한다.
4. 여러 speaker들의 3D face shape에 speech를 relating할 수 있는 training data가 거의 없다.
5. ‘speaker-specific animations’에 대한 이전 연구들이 있지만, ‘speaker independent’하거나 다양한 speaking styles를 capture하는 ‘generic method’는 없다.

Figure 1

임의의 ‘speech signal’과 ‘static 3D face mesh’가 input으로 주어졌을 때, VOCA는 ‘realistic 3D character animation’을 ouput으로 내보낸다.

VOCASET

위에서 언급한 문제를 해결하기 위해, 논문에서는 speech가 존재하는 새로운 4D face scan dataset을 취득하였다.
dataset은 12 subjects와 480개의 3-4초 sequence로 구성되어 있고, phonetic(음성) 다양성을 최대화하는 standard protocols array를 통해 선택된 sentence들이 함께 있다.
4D scans는 60fps로 capture되었고, 모든 scan에 대해 face template mesh를 alignment하였다.
위 dataset를 “VOCASET”이라고 부르고, 기존 dataset들과는 다르다.
새로운 data로 generalize할 수 있는 ‘speech-to-animation models’의 training과 test를 가능하게 한다.

VOCA

위와 같이 “VOCASET”이 주어졌을 때, 논문은 VOCA를 학습시킨다. ⇒ 새로운 speakers에 대해 generalize할 수 있다.
Deep network를 사용하는 최근 연구들은 ‘speech’를 통해 ‘speaker-dependent facial animation’을 regression하는 문제에 대해 놀라운 성능을 보여준다. ⇒ 하지만, 위 연구들은 individual의 ‘idiosyncrasies(특이한 버릇)’까지 capture한다. 따라서 characters 전반에 걸쳐 generalization하기에 부적합하다.
deep learning이 해당 분야를 빠르게 발전시키고 있지만, 최근 가장 좋은 method들조차 수동적인 절차에 의존하거나 또는 mouth에만 초점을 둔다. ⇒ 확실한 automatic full facial animation에는 부적합

이전 연구들의 주요 문제들은 facial motion과 facial identity가 혼돈된다는 것이다.
논문의 주요 insight는 ‘facial motions’의 ideneity를 factorize하고, motion에만 관련된 speech를 model에 학습시키는 것이다.
training시에 ‘subject labels’를 conditioning하는 것은 training process안의 많은 subject들의 data를 결합할 수 있게 해준다. ⇒ training시에 없었던 새로운 subjects에 대해 generalize하고 다른 speaker styles를 synthesize할 수 있다.
‘audio feature extraction’으로 “DeepSpeech”사용하는 것은 VOCA가 다른 audio source와 noise에 대해 robust하게 한다.
FLAME model을 통해
- neck을 포함한 full face motion을 modeling할 수 있고
- scan 또는 image를 통해 ‘subject-specific templates’을 reconstruction하는 것에 사용할 수 있기 때문에, 다양한 adult faces를 animate할 수 있고
- animation동안 ‘identity-dependent shape’과 ‘head pose’를 수정할 수 있다.

3. Preliminaries

VOCA의 목표는, training시에 없었던 임의적인 subject에 대해서도 generalization되는 것이다.
subjects에 대한 Generalizaion은 아래 2가지를 모두 포함한다.
- audio의 관점에서 different speakers에 대한 generalizaion(accent, speed, audio source, noise, environment의 다양성)
- different facial shapes와 motion에 대한 generalization

DeepSpeech

‘different audio source’, ‘regardless noise’, ‘recording artifacts’, ‘language’에 robustness를 얻기 위해, “DeepSpeech”를 model에 사용하였다.
“DeepSpeech”는 ‘Automatic Speech Recognition(ASR)’을 위한 end-to-end deep learning model이다.
DeepSpeech는 5개의 hidden layer로 구성되어 있다.
- 첫 3개 layer는 ReLU activation이 있는 ‘non-recurrent FC layer’이다.
- 4번째 layer는 bi-directional RNN이다.
- 5번째 layer는 ReLU activation이 있는 FC layer이다. ⇒ 마지막 layer의 output은 characters에 대한 확률분포를 output으로 하는 ‘softmax’ function의 input으로 들어간다.

FLAME

‘facial shape’과 ‘head motion’은 subject에 따라 매우 다양하다.
게다가 서로 다른 사람들은 서로 다른 speaking style을 가지고 있다. ‘facial shape’, ‘motion’, ‘speaking style’의 큰 변동성은 공통의 learning space를 사용해야 하는 이유이다.
이 문제를 FLAME을 사용하여 해결하였다. ⇒ statistical head model

4. VOCA

해당 section은 model architecture를 설명하고, input audio가 어떻게 처리되는지 자세히 설명한다.

Overview

VOCA는 input으로 ‘subject-specific template($T$)’와 ‘raw audio signal’을 받는다. ⇒ “DeepSpeech”를 통해 feature extraction
원하는 output은 ‘target 3D mesh’이다.
VOCA는 ‘encoder-decoder network’처럼 동작한다.
- Encoder: ‘audio features’를 ‘low-dimensional embedding’으로 전환하도록 학습된다.
- Decoder: ‘low-dimensional embedding’을 ‘high dimensional space 3D vertex displacements’로 mapping한다.

Figure 2

Speech feature extraction

T초의 audio clip input이 주어졌을 때, ‘speech features’를 extract하는 것에 “DeepSpeech”를 사용한다.
output은 0.02s frame(초당 50 frame)에 해당하는 characters의 ‘unnormalized log probabilities’이다. ⇒ output은 ‘50T x D matrix’이다. (D는 characters 수(alphabet의 개수 + 1(blank label)))
- unnormalized log probabilities: 로그를 취한 확률 값이지만 정규화되지 않아서, 원래의 확률 분포가 아직 전체 확률의 합이 1이 되지 않은 상태
그 후, ‘linear interpolation’을 활용하여 output을 60fps로 resample한다.
‘temporal information’을 포함시키기 위해서, ‘audio frames’를 WxD 크기의 ‘overlapping windows’로 변환한다.(W는 window의 크기)
output은 3차원 배열 (60T x W x D)

Encoder

encoder는 4개의 ‘Convolutional layers’와 2개의 ‘FC layers’로 구성된다.
‘Speech features’와 ‘final convolutional layer’는, 여러 subjects들에 걸쳐 training될 때 ‘subject-specific styles’를 학습하기 위해 subject labels로 conditioning된다.
subject들을 8번 training하기 위해, 각 ‘subject($j$)’는 ‘one-hot-vector($I_j=(\delta_{ij})_{1 \le i \le 8}$)’로 encoding된다. ⇒ 위 vector는 각 ‘D-dimensional speech feature vector’에 concat된다.
- Windows의 dimension은 W x (D+8)이 된다.
⇒ 또한 위 vector는 final convolution layer의 output에 concat된다.

‘temporal features’를 학습하고, input의 dimension을 줄이기 위해, 각 ‘convolutional layer’는 ‘3x1 kernel’, ‘2x1 stride’를 사용한다.
“DeepSpeech”를 사용해서 extract된 feature들은 ‘spatial correlation’을 가지고 있지 않기 때문에, 논문에서는 input window가 ‘W x 1 x (D+8)’ dimension을 갖도록 reshape하고, temporal dimension에 1D convolutions를 수행한다.
overfitting을 방지하기 위해서, parameter 수는 작게 유지하고, 처음 2개 convolutional layer의 32 filters과 나머지 2개 convolutional layer의 64 filters에 대해 training한다.

‘final convolution layer’과 ‘subject encoding’의 concatenation은 2개의 FC layer로 들어간다.
1번째 FC layer는 128개의 units와 hyperbolic tangent activation function이 있다.
2번째 FC layer는 50개의 linear layer이다.

VOCA의 decoder는 linear activation function이 있는 ‘FC layer’이다. ⇒ ‘subject-specific template($T$)’로부터의 5023 x 3 ‘vertex displacements’ (FLAME의 output이 5023 vertex이다.)
layer의 weights는 training data의 vertex displacements를 통해 계산된 ‘50 PCA components’에 의해 initialize된다. (bias는 0으로 initialized)

Animation control

Inference하는 동안, ‘8-d one-hot-vector’를 바꾸면 ‘speaking style’이 바뀐다.
VOCA의 output은 “FLAME”과 같은 ‘ “zero pose” expressed 3D face’이다.
FLAME과 호환 가능하기 때문에, FLAME의 ‘weighted shape blendshapes’을 추가하여 ‘identity-dependent facial shape’을 변경할 수 있다.
face expression과 pose 또한 FLAME의 blendshapes를 통해 변경할 수 있다.

5. Model training

Training set-up

‘audio-4D scan pair’ large dataset(${(x_i,y_i)}^F_{i=1}$)으로 시작한다.
- $x_i \in \R^{W \times D}$는 ‘$i$번째 video frame($y_i \in \R^{N \times 3}$)’의 중심에 위치하는 ‘input audio window’이다.
$f_i \in \R^{N \times 3}$는 $x_i$에 대한 VOCA의 ouput

training을 위해, 논문에서는 captured data를 ‘training set(8 subjects)’, ‘validation set(2 subjects)’, ‘test set(2 subject)’로 나누었다.
- Training set
  - 8 subjects의 40 sentences로 구성되었다. (총 320 sentences)
- Validataion&Test data set
  - 다른 subject들이랑 공유되지 않은 unique sentences 20문장(각각 40 sentences)
training, validation, test set은 모두 subjects와 sentences가 겹치지 않는다.

Loss function

Loss Function은 2개의 term으로 구성된다.
- position term( $E_p=||y_i-f_i||^2_F$ )
  - 예측된 outputs과 training vertex의 distance를 계산한다. ⇒ model이 ground truth와 matching되도록 한다.
- velocity term($E_v=||(y_i-y_{i-1})-(f_i-f_{i-1})||_F^2$)
  - backward finite differences ⇒연속적인 frame사이의 predicted outputs와 training vertex간의 distance를 계산, ‘temporal stability’를 유지

Training parameters

논문은 ‘held-out validation set’에 대해 hyperparameter tuning을 하였다.
VOCA는 constant ‘learning rate’ 1e-4로 50epoch 학습되었다.
‘position term’, ‘velocity term’에 대한 weights는 각각 1.0과 10.0이다.
training동안, batch size 64로 batch normalization을 진행하였다.
window size W=16, speech features D=29이다.

Implementation details

VOCA는 TensorFlow를 통해 실행되고, Adam을 통해 학습된다.
Training 1 epoch당 10분정도 소요된다.(1개의 NVIDIA Tesla K20으로)
training시에 fixed된 pre-trained “DeepSpeech” model을 사용한다.

해당 section에서는 “VOCASET”에 대해 소개하고, setup과 data processing에 대해 설명한다.

VOCASET

VOCASET은 6명의 여자, 6명의 남자 subject를 통해 capture된 ‘audio-4D scan pair’을 포함한다.
각 subject마다, 영어로 말해지는 sentence인 40개의 sequence를 수집하였고, 각 sequence의 길이는 3-5초이다.
sentences는 ‘standard protocols array’를 통해 가져온 것이고, ‘phonetic(음성)’ 다양성을 최대화하기 위해 [27] 논문을 사용하였다.
- 특히, 각 subject는 “TIMIT corpus”의 27 sentences를, [33]에서 사용된 3 ‘pangrams(알파벳이 모두 들어간 글’, “SQuAD”에서 10 questions를 말한다.

Capture setup

논문은 ‘high-quality 3D head scans and audio’를 capture하기 위해 ‘multi-camera active stereo system(3dMD LLC, Atlanta)’를 사용했다. (나머지 내용들도 capture환경에 대한 설명들이다.)

Data processing

‘raw 3D head scans’는 FLAME모델을 사용하여, ‘sequential alignment method’로 registration된다.
‘image-based landmark prediction method’는 빠른 ‘facial motion’을 tracking하는 동안 robustness를 위해 ‘alignment’할 때 사용한다.
‘alignment’ 이후, 각 mesh는 5023 3D vertex로 구성된다.
모든 scans에 대해, ‘각 scan vertex’와 ‘FLAME alignment surface의 closest point’와의 절대오차(distance)를 측정한다. ⇒ alignments가 raw data를 신뢰할 수 있게 represent한다.

그 후, 모든 mesh들은 ‘unposing’된다.
- ‘global rotation’, ‘translation’, ‘neck 주변의 head rotation’의 영향이 모두 지워진다.
unposing 후, 모든 mesh들은 “zero pose”인 상태로 되어있다.
각 sequence마다, ‘neck’영역과 ‘ears’가 자동적으로 고정되고, ‘eyes’근처 region는 noise를 제거하기 위한 ‘Gaussian filtering’을 사용하여 smoothing된다. ⇒ ‘mouth region’에는 smoothing을 적용하지 않아, ‘subtle motion’을 보존한다.

Figure 3

Paper, 3D Talking Head Generation

Paper 3D Talking Head Generation

This post is licensed under CC BY 4.0 by the author.

Abstract

1. Introduction

VOCASET

VOCA

3. Preliminaries

DeepSpeech

FLAME

4. VOCA

Overview

Speech feature extraction

Encoder

Animation control

5. Model training

Training set-up

Loss function

Training parameters

Implementation details

VOCASET

Capture setup

Data processing

Trending Tags