UCTNet 논문 리뷰 | Gihun Son

UCTNet 논문 리뷰

Posted Jan 7, 2023

By Gihun Son 14 min read

[Uncertainty-aware Cross-modal Transformer Network for Indoor RGB-D Semantic Segmentation]

[요약]

RGB-D Semantic Segmentation에서 1)depth sensor data에서 Feature를 추출하는 방법과 2)두 개의 양식에서 추출된 feature들을 효과적으로 융합하는 방법은 중요한 문제이다.

1) First Challenge(depth sensor data에서 Feature를 추출하는 방법)

sensor를 통해 얻은 depth information은 항상 믿을 수는 없다.(빛을 반사하거나 표면이 어두운 물체들에서 보통 부정확하거나, sensor의 감지를 피한다)
ConvNet을 사용하여 depth feature를 추출하는 기존의 method들은 서로 다른 pixel location에서 depth value의 신뢰도를 명쾌하게 고려하지 않는다.

⇒위 문제를 해결하기 위해 ‘Uncertainty-Aware Self-Attention’라는 새로운 mechanism을 제안한다.

‘Uncertainty-Aware Self-Attention’는 Feature Extraction동안 information이 신뢰할 수 없는 depth pixel에서 신뢰할 수 있는 depth pixel로 이동할 수 있도록 명확하게 control한다.

2) Seconde Challenge(두 개의 양식에서 추출된 feature들을 효과적으로 추출하는 방법)

효과적이고 scalable한 Cross-Attention 기반의 fusion module을 제안한다.

⇒ RGB와 depth Encoder사이의 적응 가능하고 비대칭한 information 교환을 수행할 수 있다.

————————————————————————————————————————————

논문에서 제안된 famework(UCTNet)은 탄탄하고 정확한 RGB-D Segmentation을 위해 위에서 말한 2개의 design을 자연스럽게 통합하는 Encoder-Decoder Network이다.

Introduction

Semantic Segmentation의 목표는 Input으로 RGB image가 주어졌을 때, 각 pixel을 미리 정의된 Semantic category들로 분류하는 것이다.
Single monocular RGB image는 3D scene에서 2D projection으로 ⇒ 필연적으로 Depth Dimension은 손실 될 수 밖에 없다.
Depth sensor의 발달로 이러한 정보손실이 줄어들기는 했다.

논문에서는 Depth-Assisted RGB-D Semantic Segmentation을 중점으로 다룬다

[Two Major Challenges]

1) How to effectively extract features from the additional depth input

⇒input값의 신뢰도를 명확하게 나타내는 것이 목표(CNN에서는 kernel의 크기가 고정되어 있기 때문에 이러한 것이 쉽지 않다)

따라서 Convolution operation 대신 ViT(Vision Transformer)를 사용(Self-Attention)
SA(Self-Attention)연산은 node가 pixel인 fully connected된 undirected graph(무방향 그래프)를 통해 정보를 전파 ⇒ UASA(Uncertainty-Aware Self-Attention)는 SA를 수정하여 directed graph를 통해 정보를 전파하고 불확실한 node들로부터 나오는 정보들을 명확하게 제어한다. (정확하게는 불확실한 node들의 정보들의 흐름은 제한하고, 확실한 node들의 정보는 불확실한 node들이 받아들일 수 있도록 제어한다[불확실한 node의 기능이 개선될 수 있다])

2) How to aggregate and fuse the features extracted from two input modalities

We design a new fusion module that can perform adaptive and asymmetric information exchange between two branches. Our fusion module is based on the Cross-Attention (CA) technique that aligns well with our ViT backbone and we propose two modifications to make it scalable to high-resolution feature maps and easier to train.

⇒Our final framework, namely UCTNet, is an encoder-decoder network that incorporates our proposed two designs for RGB-D Semantic Segmentation.

[논문에서 언급한 기여]

We introduce a novel Uncertainty-Aware Self-Attention mechanism to explicitly handle the feature extraction from inputs with uncertain values. ⇒불확실한 값을 갖는 input에서 Feature extraction을 명확하게 처리하기 위해 UASA mechanism 제안
We design an effective and scalable fusion module that can perform adaptive and asymmetric information exchange between two branches. ⇒2개의 branch에서 적응적이고 비대칭적인 정보를 교환할 수 있게 해주는 effective and scalable fusion module을 design
Our proposed framework, namely UCTNet, achieves new state-of-the-art performance on two public benchmarks and outperforms all existing methods with significant improvements. ⇒제안된 famework, UCTNet은 최고의 성능을 보여준다.

Main

[요약]

UCTNet은 두개의 병렬 Encoder(RGB, Depth Encoder)와 Segmentation결과를 만들어내는 Semantic Decoder로 이루어져 있다.

RGB image 하나를 input으로 가지고 강력하고 효율적인 ViT Backbone인 Swin-S Architecture를사용
RGB Encoder는 RGB image를 input으로 받으면 Patch Embedding layer를 통해 Patch Feature들을 만든다.
Patch Feature들은 image feature들을 각각 1/4, 1/8, 1/16, 1/32 해상도로 만드는 4개의 순차적인 Transformer Block들을 통과하게 된다.

Depth Encoder는 기존의 방식과 달리, Depth Map뿐만 아니라 Depth Uncertainty Map(뒤에 자세한 설명 나온다)도 input으로 받는다. Depth Encoder는 아래의 2가지만 빼면 RGB Encoder와 동일한 Architecture를 갖는다.

1) 모든 Self-Attention(SA) layer를 Uncertainty-Aware Self-Attention(UASA) layer로 대체

2) Depth Uncertainty를 Depth Map과 concatenate하여 Patch Embedding layer에 feed

각 Encoder block의 output에서 RGB Encoder와 Depth Encoder사이의 정보를 융합하고 교환하기 위해 제안된 ‘Fusion Module’을 사용한다.
‘Fusion Module은 RGB와 Depth branch로부터 input을 받고 해당 Encoder의 다음 block에 update된 Feature를 return한다.

Semantic Decoder는 각 Fusion Module에서 융합된 Feature들을 input으로 받고, 최종 segmentation 결과를 생성한다.
semantic decoder로 UperNet decoder를 사용하였다.

[자세한 설명]

일반적으로 기존의 깊이 센서는 일반적으로 반사율이 높거나 빛 흡수율이 높은 표면의 깊이를 측정하는 데 어려움(센서로 Depth를 측정하는 것은 물리적 환경에 영향을 받을 수 밖에 없다)
Kinect과 같은 일반적인 depth sensor는 depth를 정확히 측정할 수 없는 경우 무효값을 반환 ⇒이러한 경우에 Uncertainty Map(binary map) U ∈ {0, 1} (H×W)으로 나타낸다. (0은 sensor판독값 없음, 1은 유효한 sensor값을 나타낸다)
최신 Sensor는 3단계 신뢰도 map을 가지는 기술도 있지만 binary map으로 normalize할 수 없기 때문에 사용x
Binary, Multi-Level Discrete, 또는 Continuous한 값들로 구성된 Uncertainty map(U ∈ [0, 1])의 일반적인 유형을 위해 Uncertainty-Aware Self-Attention(UASA)를 공식화
Kinect과 같이 binary uncertainty map만을 제공하는 일반적인 Sensor를 사용하여 대부분의 RGB-D Semantic Segmentation을 진행하였다는 것이 언급할 가치가 있다. ⇒ 단순히 Uncertainty를 Framework에 결합하는 것으로 성능을 향상

위는 일반적인 Self-Attention에 대한 설명으로 생략(Transformer논문 정리 참고)

Uncertainty-Aware Self-Attention은 방향성이 존재 ⇒ E(i→j)와 E(j→i)가 다르다. Node끼리의 information 흐름을 제어
방법은 1) Cut-off 와 2) Suppression 2가지

Cut-off방식은 Uncertain nodes(불확실한 node)로부터 나오는 information을 아예 차단한다.
Uncertain nodes는 다른 confident nodes로 부터 information을 받을 수 있고, node features를 update할 수 있다.

UASA(Suppression)방식은 UASA(Cut-off)방식이 너무 과격하다는 생각에서 나옴 (Uncertain nodes의 information도 여전히 유용할 수 있기 때문)
Multiple Transformer Layer를 거치면서, Uncertain nodes의 불확실성이 줄어듦
‘T’라는 hyper-parameter를 두어 Uncertain nodes의 Attention Weight를 줄인다.

위의 formulation들은 일반적인 SA(Self-Attention)에 기초하여 논문(UCTNet)의 UASA를 정의
논문의 Encoder에 적용한 Swin-S Backbone은 Shift-Window Self-Attention(SWSA)라는 새로운 버전의 SA를 사용한다. ⇒ 하지만 Window-Partition 연산을 Input Image와 함께 Uncertainty map(U)에 적용하기만 한다면, 위의 formulations는 SWSA에도 동일하게 동작한다.

Fusion Module의 목표는 2개의 Encoding Stream사이에서 Feature Fusion(융합)과 Information Exchange를 달성하는 것이다.
Decoder는 수정하지 않았다.(기존 Decoder와 호환되기 위해)

[Fusion Module 설계 원칙]

Attentive(세심함): The features from different modalities should be combined in an attentive way instead of simple element-wise addition. ⇒단순한 요소 별 덧셈이 아닌 서로 다른 modality(양식)의 Features는 Attentive한 방식으로 결합되어야 한다.
Adaptive(적응 가능함): The attention/weight to perform attentive fusion should be generated by adaptively considering both input modalities. ⇒input modality 두가지 모두 적응적으로 고려하여 Attentive Fusion을 수행하기 위한 Attention과 Weight를 생성해야 한다.
Bidirectional(양방향성): Instead of one-way passing the feature from one modality to another, we prefer to exchange the information between two modalities. ⇒하나의 modality에서 다른 modality로만 가는 단방향 대신, 두개의 modality사이에 information이 교환되는 양방향을 선호한다.
Asymmetric(비대칭): The combined features passing back to different encoders should be different, i.e. F(depth→rgb) != F(rgb→depth), where F denotes the fusion function. ⇒서로 다른 Encoder로 다시 전달되는 Combined Feature는 서로 달라야 한다. (F는 Fusion Function을 의미)

위의 4가지 Design 원칙 외에도 ViT BackBone에 맞추어 정렬하기 위해 Attention mechanism을 사용하여 Fusion Module을 설계하는 것을 선호한다.
‘Window Cross Attentive Fusion(WCAF) layer는 Fusion module의 핵심 ⇒’Source’ Feature와 ‘Target’ Feature를 입력으로 받는다 ⇒’Source’ Feature에서의 information을 ‘Target’ Feature로 융합하는 것이 목표

’Source’ Feature와 ‘Target’ Feature는 Fusion direction에 따라 RGB 또는 Depth Modality에서 나올 수 있다.
Cross-Attention mechanism을 기반으로 하는 WCAF layer는 2개의 중요한 변화가 있다. 1) 원래의 Cross-Attention은 high-resolution(높은 해상도)의 Feature로 scale하기 힘든 quadratic complexity를 가지고 있어, 논문은 linear complexity를 가지고, Encoder의 초기 단계에서 얻은 hight-resolution features를 융합하는 것에 사용될 수 있는 Window Cross-Attention을 제안한다. 2) dense features를 바로 융합하는 것을 잘 수행하는 Cross-Attention layer를 학습시키는 것은 매우 어렵다. 논문에서는 Cross-Attention을 Channnel-Attention과 결합하여, Training Difficulties를 매우 줄이는 Channel Weight를 생성하면 되도록 해결하였다.

WCAF layer는 1, 2번 조건(Attentive, Adaptive)을 충족하였다.
나머지 2개의 조건을 충족시키기 위해, 2개의 독립적인 WCAF layer를 Fusion Module에 적용 ⇒F(depth→rgb)와 F(rgb→depth)를 따로 적용(Fusion Model의 WCAF layer가 2개) (WCAF layer는 input의 SRC와 TGT가 서로 바뀌는 것을 제외하면 같다) ⇒각각 그들만의 modality-specific feature를 extract하기 위해 RGB와 Depth Encoder를 독립적으로 두면서 각각의 Feature를 모두 향상시킬 수 있다.

논문은 Indoor RGB-D Semantic Segmentation의 새로운 Framework를 제안
[2가지의 주요 과제를 해결] 1) Depth image에서 Features를 더 잘 extract하는 것 ⇒Uncertainty-Aware Self-Attention을 제안 (Uncertain, Confident nodes 사이의 information flow를 명확하게 control할 수 있다) 2) 2개의 modality(RGB, Depth)의 information을 더 효과적으로 융합하고 결합하는 것 ⇒기존의 Fusion module들의 문제점을 분석하고, 4가지 원칙에 따라 새로운 Fusion module 만듦

Paper, RGB-D Segmentation

This post is licensed under CC BY 4.0 by the author.

Introduction

논문에서는 Depth-Assisted RGB-D Semantic Segmentation을 중점으로 다룬다

[논문에서 언급한 기여]

Main

[요약]

[자세한 설명]

[Fusion Module 설계 원칙]

Trending Tags