๐Ÿ“š[๋…ผ๋ฌธ ๋ณด๋Ÿฌ๊ฐ€๊ธฐ](https://arxiv.org/abs/1807.06521) # Abstract - **CNN**์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œํ‚ค๊ธฐ ์œ„ํ•œ **์ฃผ์˜(attentional) ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ชจ๋“ˆ**์„ ์ œ์•ˆํ•œ ์—ฐ๊ตฌ์ด๋‹ค. - **CNN**์˜ ์ค‘๊ฐ„ **feature map**์— ๋Œ€ํ•ด ๋‘ ๊ฐ€์ง€ ์ฐจ์›์—์„œ **attention**์„ ๊ณ„์‚ฐํ•œ๋‹ค. 1. Channel Attention : ๋ฌด์—‡์„ ๋ณผ ๊ฒƒ ์ธ๊ฐ€ 2. Spatial Attention : ์–ด๋””๋ฅผ ๋ณผ ๊ฒƒ ์ธ๊ฐ€ - ํ•ด๋‹น ๋‘ attention์„ **์ˆœ์ฐจ์ **์œผ๋กœ ์ ์šฉํ•˜์—ฌ **feature map**์„ ๋” ์˜๋ฏธ ์žˆ๋Š” *ํŠน์ง•*์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. - ํ•ด๋‹น attention module์€ **๊ฐ€๋ณ๊ณ **, **์ผ๋ฐ˜์ ์ธ ๊ตฌ์กฐ**๋ฅผ ๊ฐ€์ง„๋‹ค. - ์–ด๋–ค CNN์—๋„ **์‰ฝ๊ฒŒ ๋ถ™์—ฌ** ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. - **์ถ”๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋‚˜ ๊ณ„์‚ฐ ๋น„์šฉ**์ด ๊ฑฐ์˜ ์—†๋‹ค. - ์ „์ฒด ๋„คํŠธ์›Œํฌ์™€ ํ•จ๊ป˜ **end-to-end ํ•™์Šต**์ด ๊ฐ€๋Šฅํ•˜๋‹ค. # 1. Introduction - ๊ธฐ์กด์˜ ๋…ผ๋ฌธ๋“ค์€ **3๊ฐ€์ง€ ์ธก๋ฉด**์—์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ๋‹ค. | ๊ฐœ์„  ์š”์†Œ | ์„ค๋ช… | ๋Œ€ํ‘œ ์—ฐ๊ตฌ | | ----------------- | --------------------- | --------------------- | | Depth | ๋” ๋งŽ์€ ์ธต์„ ์Œ“์•„ ๋ณต์žกํ•œ ํŠน์ง• ํ‘œํ˜„ | LeNet โ†’ VGG โ†’ ResNet | | Width | ๋ณ‘๋ ฌ ๊ตฌ์กฐ ํ™•์žฅ์œผ๋กœ ํ‘œํ˜„, ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€ | GoogLeNet, WideResNet | | Cardinality (์žกํ•ฉ์„ฑ) | ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์˜ ํŠน์ง•์„ ๋ณ‘๋ ฅ ํ•™์Šต | ResNext, Xception | - **CBAM**์€ ์ด๋Ÿฌํ•œ **๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ**๊ฐ€ ์•„๋‹ˆ๋ผ, ์ธ์ง€์  ๋ฉ”์ปค๋‹ˆ์ฆ˜์ธ **Attention**์— foucs ํ•˜์˜€๋‹ค. - ์‚ฌ๋žŒ์ด ์ „์ฒด ์žฅ๋ฉด์ด ์•„๋‹Œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ๋ณด๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ CNN๋„ **์ค‘์š”ํ•œ ๋ถ€๋ถ„(feature)์„ ๊ฐ•์กฐ**ํ•˜๊ณ  **๋œ ์ค‘์š”ํ•œ feature์„ ์–ต์ œ**ํ•˜๋„๋ก ๋งŒ๋“ค์–ด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ![[Convolutional Block Attention Module.png]] - CBAM์€ CNN์˜ feature map์„ ๋ฐ›์•„์„œ **Channel ๋ฐฉํ–ฅ('๋ฌด์—‡์„ ๋ณผ ์ง€')๊ณผ Spatial ๋ฐฉํ–ฅ(์–ด๋””๋ฅผ ๋ณผ์ง€)๋กœ** ๊ฐ๊ฐ attention์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ **์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉ**ํ•œ๋‹ค. - CBAM๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. - ImageNet-1K์—์„œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ํ–ฅ์ƒ - MS COCO / VOC 2007์—์„œ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ํ–ฅ์ƒ - Grad_CAM ์‹œ๊ฐํ™”์—์„œ ๋ชจ๋ธ์ด ๋” ์ •ํ™•ํžˆ ๋ชฉํ‘œ ๊ฐ์ฒด์— ์ง‘์ค‘ํ•จ - User Study(์‚ฌ๋žŒ ํ‰๊ฐ€)์—์„œ CBAM ๋ชจ๋ธ์ด ๋” **์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ณธ๋‹ค๊ณ  ํ‰๊ฐ€**ํ–ˆ๋‹ค - ํ•ด๋‹น ์—ฐ๊ตฌ์—์„œ๋Š” **3๊ฐœ์˜ ๊ธฐ์—ฌ์ **์„ ๋ช…์‹œํ•œ๋‹ค. - ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ(simple yet effective) **CBAM attetion ์ œ์•ˆ** - ๊ด‘๋ฒ”์œ„ํ•œ ablation ์‹คํ—˜์„ ํ†ตํ•ด **์„ค๊ณ„ ์„ ํƒ์˜ ํƒ€๋‹น์„ฑ ์ ์ฆ** -> ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์กฐํ•ฉ์œผ๋กœ ์‹คํ—˜ํ•˜๋ฉด์„œ **ํŠน์ • ์š”์†Œ**๊ฐ€ **์„ฑ๋Šฅ ํ–ฅ์ƒ**์— ์‹ค์ œ๋กœ ๊ธฐ์—ฌํ–ˆ๋Š”์ง€ ๋ถ„์„ - ์—ฌ๋Ÿฌ ๋ฐด์น˜๋งˆํฌ(ImageNet, MS COCO, VOC etc.)์—์„œ **์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ** # 2. Related Work ## Network engineering - well-designed networks๋Š” Model์˜ performance ํ–ฅ์ƒ์„ ๋ณด์žฅํ•œ๋‹ค. - ๊ธฐ์กด์˜ CNN ๋ชจ๋ธ๋“ค์€ **Depth, Width, Cardinality**์„ ์กฐ์ •ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋ƒˆ๋‹ค. - [[Deep Residual Learning for Image Recognition|Resnet]]์€ **skip connection**์„ ํ†ตํ•ด ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€ ์‹œ์ผœ ํ‘œํ˜„๋ ฅ์„ ๊ฐ•ํ™”ํ–ˆ๋‹ค. - **WideResNet**์€ ๊นŠ์ด ๋Œ€์‹  ๋„ˆ๋น„๋ฅผ ์ฆ๊ฐ€ ์‹œ์ผœ ๋ณ‘๋ ฌ ๊ตฌ์กฐ๋กœ ๋” ๋งŽ์€ feature ์กฐํ•ฉ์„ ๋งŒ๋“ค์–ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค. - **ResNeXt**๋Š” **grouped convolution**์„ ์‚ฌ์šฉํ•ด **cardinality**๋ผ๋Š” ์„ธ ๋ฒˆ์งธ ์ถ•์„ ํ™•๋ฆฝํ–ˆ๋‹ค. - ํ•˜์ง€๋งŒ **CBAM**์€ human visual ststem ์š”์†Œ๋ฅผ ์ฐฉ์•ˆํ•ด **attention**์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค. ## Attention mechanism - ์‚ฌ๋žŒ์€ ๋ฌผ์ฒด๋ฅผ ๋”์šฑ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ์ „์ฒด์ ์œผ๋กœ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์„ ํƒ์ ์œผ๋กœ **์ค‘์š”ํ•œ ๋ถ€๋ถ„(salient part)์—** ์ง‘์ค‘ํ•œ๋‹ค. - Attention ๊ธฐ๋ฐ˜ CNN ์—ฐ๊ตฌ๋Š” ์ด์ „์—๋„ ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌํ–ˆ์ง€๋งŒ ํ•œ๊ณ„์  ๋˜ํ•œ ์ด ์กด์žฌํ–ˆ๋‹ค.. - Wang et al. (2017) : Encoder-Decoder ํ˜•์‹์˜ Attention / ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ณ  ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•ด์กŒ๋‹ค. - Hu et al. (2017) : Channel-wise Attention(SE-Net) / separate attention์„ ๋ฌด์‹œํ–ˆ๋‹ค. - SE-Net์€ **global average pooling**์„ ๊ตฌํ•ด **chanel attetion**์„ ํ•™์Šตํ–ˆ์ง€๋งŒ CBAM๋Š” ์—ฌ๊ธฐ์— **Max Pooling**์„ ๋”ํ•˜์—ฌ ๋” ํ’๋ถ€ํ•œ ์ฑ„๋„ ์ •๋ณด๋ฅผ ํ•™์Šตํ•œ ํ›„ **Spatial attetion**์„ ์ถ”๊ฐ€ํ–ˆ๋‹ค. - **BAM(Bottleneck Attention Module)์€** attention์„ channel๊ณผ spatial์„ ๋ถ„๋ฆฌํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค๋Š” ์ ์—์„œ๋Š” ์œ ์‚ฌํ•˜์ง€๋งŒ CBAM์— ๋น„ํ•ด ๊ตฌ์กฐ ๋ณต์žก๋„๊ฐ€ ํฌ๊ณ  ๋„คํŠธ์›Œํฌ์˜ bottleneck ๋ถ€๋ถ„์—์„œ๋งŒ ์‚ฝ์ž…ํ•˜์—ฌ **CBAM์ด ๋” ๊ฐ€๋ณ๊ณ , ์ผ๋ฐ˜์ ์ธ Module๋กœ ์„ค๊ณ„**๋˜์—ˆ๋‹ค. - ๊ฒฐ๋ก ์ ์œผ๋กœ CBAM์€ **๋ชจ๋“  Convolutoin block์— ์‚ฝ์ž…**์ด ๊ฐ€๋Šฅํ•˜๊ณ  **์ˆœ์ฐจํ˜•(channel โ†’ spatial)๋กœ** ์ ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— CNN ๋ชจ๋ธ์— **plug-and-playํ˜•**์œผ๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. # 3. Convolutionhal Block Attention Module - CBAM์€ CNN์˜ ์ค‘๊ฐ„ feature map FโˆˆR<sup>Cร—Hร—W</sup>์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋‘ ๋‹จ๊ณ„์˜ attention์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉํ•œ๋‹ค. - ![[CBAM Formula.png]] - M<sub>cโ€‹</sub>(F): Channel attention map (ํฌ๊ธฐ Cร—1ร—1) - M<sub>s</sub>(Fโ€ฒ): Spatial attention map (ํฌ๊ธฐ 1ร—Hร—W) - โŠ—: element-wise ๊ณฑ (attention์„ ๊ณฑํ•ด์คŒ) - Channel attention์„ ๊ฑฐ์นœ ๊ฐ’์ด F'์ด๊ณ  ํ•ด๋‹น F'์„ ํ†ตํ•ด Spatial attention์„ ๊ฑฐ์นœ ์ตœ์ข… ๊ฐ’์ด F''์ด๋‹ค. ## Channel attention module - Channel attention์€ ๋ฌด์—‡์„ ๋ณผ์ง€, ์ฆ‰ **์–ด๋–ค ์ฑ„๋„์ด ๋” ์ค‘์š”ํ•œ์ง€ ํ•™์Šต**ํ•˜๋Š” module์ด๋‹ค. ![[Channel Attention Module.png]] - ์ž…๋ ฅ feature map์— **F**์— ๋Œ€ํ•ด **average pooling**๊ณผ **max pooling**์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์š”์•ฝํ•œ๋‹ค. - ๋‘ pooling ๊ฒฐ๊ณผ๋ฅผ **๊ณต์œ ๋œ(shared) MLP**์— ํ†ต๊ณผ์‹œ์ผœ ์ฑ„๋„๋ณ„ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. - **MLP**๋Š” 2์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ง„ fully-connected ๋„คํŠธ์›Œํฌ์ด๋‹ค. - **MLP**๊ฐ€ 2์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์ด์œ ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ Layer๋Š” R<sup>C/rx1x1</sup>๋กœ ์ฐจ์›์„ r๋น„์œจ๋กœ **์ค„์ด๊ณ (squeeze)** ๊ทธ ์ดํ›„์— ์ฐจ์›์„ ๋ณต์›ํ•˜๋Š” **excitation**์„ ์ง„ํ–‰ํ•œ๋‹ค. - ![[Channel Attention Formula.png]] - ฯƒ : **sigmod ํ•จ์ˆ˜** - F<sup>c</sup>avg, F<sup>c</sup>max : ๊ฐ๊ฐ Feature map์— ๋Œ€ํ•ด์„œ average pooling๊ณผ max pooling์„ ๊ฑฐ์ณ ๋‚˜์˜จ ๊ฐ’ - W<sub>0</sub>, W<sub>1</sub> : ๊ฐ๊ฐ MLP์˜ **squeeze**์™€ **excitation** ## Spatial attention module - Spatial attention์€ ์–ด๋””๋ฅผ ๋ณผ์ง€, ์ฆ‰ **feature map ๋‚ด ์ค‘์š”ํ•œ ์œ„์น˜(์ขŒํ‘œ)๋ฅผ** ํ•™์Šตํ•˜๋Š” module์ด๋‹ค. ![[Spatial Attention Module.png]] - ์ฑ„๋„ ๋ฐฉํ–ฅ์œผ๋กœ average pooling๊ณผ max pooling์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ฑ„๋„์„ ์ถ•์†Œํ•œ๋‹ค. - ๋‘ pooling ๊ฒฐ๊ณผ๋ฅผ ํ•˜๋‚˜์˜ ๋งต์œผ๋กœ ์—ฐ๊ฒฐ(Concatenate)ํ•œ๋‹ค. F<sup>s</sup>avg, F<sup>s</sup>maxโˆˆR<sup>2ร—Hร—W</sup> - **7x7 convolution Filter**๋กœ Spatial attention์„ ์ตœ์ข…์ ์œผ๋กœ ๊ตฌํ•œ๋‹ค. 7x7 ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” **๋„“์€ ์ˆ˜์šฉ ์˜์—ญ(receptive field)๋ฅผ** ํ™•๋ณดํ•ด ๋” ์ •ํ™•ํ•œ ์œ„์น˜๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•จ - ![[Spatial Attention Formula.png]] - ฯƒ : **sigmod ํ•จ์ˆ˜** - F<sup>s</sup>avg, F<sup>s</sup>max : ๊ฐ๊ฐ Feature map์— ๋Œ€ํ•ด์„œ average pooling๊ณผ max pooling์„ ๊ฑฐ์ณ ๋‚˜์˜จ ๊ฐ’ - f<sup>7x7</sup> : 7x7 **convolution Filter** ## Arrangement of attention modules - ๋‘ modules์€ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ ํ•  ์ง€์™€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ˆ˜ํ–‰ ํ•  ์ง€๊ฐ€ ์ค‘์š”ํ•˜๋‹ค. - ์ด๋Š” ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด **๋ณ‘๋ ฌ์ ๋ณด๋‹ค ์ˆœ์ฐจ์ **์œผ๋กœ **Spatial๋ณด๋‹ค Channel๋ฅผ ๋จผ์ € ์ˆ˜ํ–‰**ํ•˜๋Š” ๊ฒƒ์ด ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€์žฅ ์ข‹๊ฒŒ ๋‚˜์™”์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. # 4. Experiments - ์‹คํ—˜์€ ์ด 3๋‹จ๊ณ„๋กœ 4.1 Ablation Studies, 4.2 Image Classification(ImageNet-1K), 4.3~4.6 Object Detection + Visualization ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. - 4.1 Ablation Studies : ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ CBAM ๊ตฌ์กฐ ์ฐพ๊ธฐ - 4.2 Image Classification(ImageNet-1K) : CBAM๊ฐ€ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ์–ผ๋งˆ๋‚˜ ๋†’์ด๋Š”์ง€ - 4.3~4.6 Object Detection + Visualization : CBAM๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋˜๋Š”๊ฐ€ ## 4.1 Ablation Studies - Ablation Studies์„ ์œ„ํ•ด **ImageNet-1K** ๋ฐ์ดํ„ฐ ์…‹๊ณผ **ResNet-50** Base Model์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. - data ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋Š” 224x224๋กœ ์ž˜๋ผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. - **learning rate๋Š” 0.1**๋กœ ์‚ฌ์šฉํ•˜์˜€๊ณ  epoch 30๋ฒˆ๋งˆ๋‹ค dropํ•˜์˜€๋‹ค. - **epoch๋Š” ์ด 90**์œผ๋กœ train ํ•˜์˜€๋‹ค. ### Channel attention ![[Comparison of different channel attention methods.png]] - ResNet50์— **Avg Pool๋งŒ** ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•(SE-Net ๋ฐฉ์‹)์€ ์ด๋ฏธ์ง€ ์ „์ฒด์˜ **ํ‰๊ท ์  ์ค‘์š”๋„**๋งŒ ๋ฐ˜์˜ํ•ด์„œ ๋ถ€๋“œ๋Ÿฝ์ง€๋งŒ ์„ธ๋ฐ€ํ•œ ๊ตฌ๋ถ„์€ ์–ด๋ ต๋‹ค. - ResNet50์— **MaxPool๋งŒ** ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐ€์žฅ **๊ฐ•ํ•œ ํŠน์ง•(peak activatoin)์—**๋งŒ ์ง‘์ค‘ํ•˜์—ฌ ๋ถ€๋ถ„์ ์œผ๋กœ๋Š” ๊ฐ•ํ•˜์ง€๋งŒ ์ „์ฒด ๋งฅ๋ฝ์€ ์•ฝํ•˜๋‹ค. - ResNet50์— **Avg Pool + Max Pool**์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์€ **์ „์—ญ**๊ณผ **์ง€์—ญ** ํŠน์ง•์„ ๋ชจ๋‘ ๋ฐ˜์˜ํ•˜์—ฌ Channel attention์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ### Spatial attention ![[Comparison of different spatial attention methods.png]] - ResNet์— **Avg Pool + Max Pool**์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์€ Channel attention ๊ฒฐ๋ก ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋” ๋งŽ์€ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค. - ๋˜ํ•œ sptial์˜ kernel size๊ฐ€ ํด ์ˆ˜๋ก **receptive field(์ˆ˜์šฉ ์˜์—ญ)์ด** ๋„“์–ด์ ธ ๋” ๋„“์€ **context(๋ฌธ๋งฅ)์„** ๋ณผ ์ˆ˜ ์žˆ๋‹ค. - ๊ฒฐ๋ก ์ ์œผ๋กœ Spatial attention์€ **Avg Pool +Max Pool**์— **7x7 conv**์ด ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”๋‹ค. ### Arrangement of the channel and spatial attention ![[Combining methods of channel and spatial attention.png]] - ResNet์— ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค **์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ**์ด ์ˆœ์„œ ์ƒ๊ด€์—†์ด ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์–ด์—ˆ๋‹ค. - ๊ทธ๋ฆฌ๊ณ  ํŠนํžˆ **spatial**๋ณด๋‹ค **channel์„ ๋จผ์ € ์ˆ˜ํ–‰**ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ ๋ฐ˜๋Œ€๋ณด๋‹ค ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. - Channel์ด Spatial attention ๋ณด๋‹ค ๋จผ์ € ์ˆ˜ํ–‰๋˜๋Š” ์ด์œ ๋Š” **์ •๋ณด์˜ ์ถ”์ƒ๋„์™€ ํ๋ฆ„** ๋•Œ๋ฌธ์ด๋‹ค. - CNN feature map์˜ ๊ฐ ์ฑ„๋„์€ **ํŠน์ • ํŒจํ„ด์„ ํƒ์ง€ํ•˜๋Š” ํ•„ํ„ฐ**์ด๋‹ค. - ์ฑ„๋„ 1 -> ๊ณ ์–‘์ด ๊ท€ ํƒ์ง€ - ์ฑ„๋„ 2 -> ์ˆ˜์ง ์˜ˆ์ง€ ํƒ์ง€ - ์ฑ„๋„ 3 -> ํ’€๋ฐญ ์งˆ๊ฐ ํƒ์ง€ - ์ด feature map์—์„œ ์ „์ฒด๊ฐ€ ์ด๋‚˜๋ฆฌ **์ค‘์š”ํ•œ ์ฑ„๋„**๋งŒ ๋‚จ๊ธฐ๊ฒŒ ๋˜๋ฉด **semanticํ•˜๊ฒŒ ๊นจ๋—**ํ•ด์ง„๋‹ค. ์ด๊ฒŒ ๋ฐ”๋กœ Channel attetion์ด๋‹ค. - ์ด๋ ‡๊ฒŒ ์–ป์–ด์ง„ featrue map์—์„œ ์šฐ๋ฆฌ๋Š” ์ด์ œ **'์–ด๋””์— ์ค‘์š”ํ•œ ํŠน์ง•์ด ์žˆ๋Š”๊ฐ€'๋ฅผ** ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ด๋‹ค. ์ด๊ฒŒ ๋ฐ”๋กœ Spatial attetion์ด๋‹ค. - ์‰ฝ๊ฒŒ ๋งํ•ด์„œ **๋‡Œ์—์„œ ๋ฌด์—‡์„ ์ฐพ์„์ง€ ์ธ์ง€**ํ•˜๊ณ  **๋ˆˆ์œผ๋กœ ํ•ด๋‹น Object๋ฅผ ์ฐพ๋Š”** ๊ณผ์ •์ด ์ด ๊ณผ์ •์— ํ•ด๋‹น ํ•œ๋‹ค๊ณ  ๋ถˆ ์ˆ˜์žˆ๋‹ค. ## 4.2 Image Classification(ImageNet-1K) - CBAM์„ ๋‹ค์–‘ํ•œ CNN ์•„ํ‚คํ…์ฒ˜์— ๋ถ™์˜€์„ ๋•Œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์„๊นŒ? - ์ด 4๊ฐ€์ง€ ์•„ํ‚คํ…์ฒ˜์— **SE**๊ณผ **CBAM**์„ ๋ถ™์—ฌ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ๋‹ค. | ์•„ํ‚คํ…์ฒ˜ | ๋ชฉ์  | ์„ค๋ช… | | --------------------- | -------------------- | ------------------------ | | **ResNet-50 / 101** | ๊ณ ์„ฑ๋Šฅ ํ‘œ์ค€ ๋ชจ๋ธ | Residual ๊ตฌ์กฐ ๊ธฐ๋ฐ˜์˜ deep CNN | | **ResNeXt-50 / 101** | ์ง‘ํ•ฉ์„ฑ(cardinality) ์‹คํ—˜์šฉ | Grouped convolution ๊ตฌ์กฐ | | **WideResNet-50-2** | ๋„ˆ๋น„(width) ํ™•์žฅ ๊ตฌ์กฐ | ๊ฐ block์˜ ์ฑ„๋„ ํญ์„ ๋Š˜๋ฆฐ ๋ฒ„์ „ | | **MobileNet (ฮฑ=0.7)** | ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ | ํšจ์œจ์„ฑ ์‹คํ—˜์šฉ, ์‹ค์‹œ๊ฐ„ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ํ‰๊ฐ€ | - ๊ฒฐ๋ก ์ ์œผ๋กœ **SE**์„ ๋ถ™์ธ ์•„ํ‚คํ…์ณ๊ฐ€ ์ผ๋ฐ˜์ ์ธ ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๊ณ  **CBAM**์„ ๋ถ™์ธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ **SM**์„ ๋ถ™์ธ ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ## 4.3~4.6 Object Detection + Visualization ### 4.3 Network Visualization with Grad-CAM - **Grad-CAM**์„ ํ†ตํ•ด ๊ฐ ๋ชจ๋ธ์ด ๊ฐ™์€ ์ด๋ฏธ์ง€๋ฅผ ๋ณผ ๋•Œ **ํ™œ์„ฑํ™”(heatmap)๊ฐ€** ์–ด๋””์— ์ง‘์ค‘๋˜๋Š”์ง€ ์‹œ๊ฐ์ ์œผ๋กœ ๋น„๊ต๋ฅผ ํ•ด๋ดค๋‹ค. ![[Grad-CAM visualization results.png]] - ํ•ด๋‹น ๊ฒฐ๊ณผ CBAM์ด ๋‹จ์ˆœํžˆ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ์‹œํ‚ฌ ๋ฟ ์•„๋‹ˆ๋ผ ์ง„์งœ **์ฃผ์˜(attention)์„** ๋” ์ž˜ ํ•™์Šตํ•˜๊ณ  ์žˆ๋Š”์ง€ **์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธ**ํ•˜์˜€๋‹ค. ### 4.5~4.6 Object Detection - CBAM์ด classification ๋ง๊ณ ๋„ ๋‹ค๋ฅธ ๋น„์ „ ๊ณผ์ œ, ํŠนํžˆ Object Detection์—๋„ **์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ**์„ ๋ณด์˜€๋‹ค. - ๋˜ํ•œ ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ **์‹ค์‹œ๊ฐ„ ํƒ์ง€๊ธฐ(SSD, StariNet)์—์„œ๋„** CBAM์ด ํšจ๊ณผ์ ์ด์—ˆ๋‹ค. - ์ด๋ฅผ ํ†ตํ•ด CBAM์€ ์ „ ์˜์—ญ์—์„œ ์ผ๊ด€๋˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ ํ•ด์„๋ ฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. - ์ฆ‰, ์–ด๋– ํ•œ CNN์—์„œ๋„ ์‰ฝ๊ฒŒ ๋ถ™์ผ ์ˆ˜ ์žˆ๋Š” **plug-and-play ๋ชจ๋“ˆ**์ž„ ์ž…์ฆํ–ˆ๋‹ค. # 5. Conclusion - **CBAM**์€ ๊ธฐ์กด์˜ ๊ตฌ์กฐ ๋ณ€๊ฒฝ ์—†์–ด๋„ ๊ฐ„๋‹จํžˆ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” **plug-and-playํ˜•** ๋ชจ๋“ˆ์ด๋‹ค. - **CBAM**์€ ๋‘ ๋‹จ๊ณ„์˜ **attention(Channel Attention, Spatial Attention)** ์œผ๋กœ ๊ตฌ์„ฑ๋˜๊ณ  ์ด๋ฅผ ์ด์šฉํ•ด feature map์„ *์ ์ง„์ ์œผ๋กœ ์ •์ œ(refine)* ํ•œ๋‹ค. - ๊ธฐ์กด [[Squeeze-and-Excitation|SE(Squeeze-and-Excitation)]] Module์€ *average pooling*๋งŒ ์‚ฌ์šฉํ–ˆ์œผ๋‚˜ CBAM๋Š” **max pooling**์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ, ๋” ์ •๊ตํ•œ attetion์„ ํ•™์Šตํ•˜์˜€๋‹ค. - ํŠนํžˆ **Spatial attentional** ์ฆ‰ **'์–ด๋””์— ์ง‘์ค‘ํ•˜๋Š”์ง€'๋ฅผ** ํ•™์Šตํ•˜์—ฌ CNN์ด **๊ฐ์ฒด์˜ ํ•ต์‹ฌ ์œ„์น˜์— ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„ํ•˜์˜€๋‹ค. - ๋‹ค์–‘ํ•œ ๋ชจ๋ธ(ResNetm ResNeXt, MobileNet ๋“ฑ)๊ณผ ๋ฐ์ดํ„ฐ ์…‹์— ์ ์šฉํ–ˆ์„ ๋•Œ **๋” ๋†’์€ ์ •ํ™•๋„**๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. - ImageNet์—์„œ **Top-1/Top-5 error ๊ฐ์†Œ** - MS COCO/VOC์—์„œ **๊ฐ์ฒด ํƒ์ง€ ์ •ํ™•๋„(mAP)** ํ–ฅ์ƒ # Code review ๐Ÿ“š[์ฝ”๋“œ ๋ณด๋Ÿฌ๊ฐ€๊ธฐ](https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py) - ResNet + CBAM์„ ๊ตฌํ˜„ํ•ด๋ณด๊ณ  SE-ResNet-50, ResNet-50 Model๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด๋ณด์•˜๋‹ค. - DateSet์€ **CIFAR10**์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ## ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฐ ์ฆ๊ฐ• 1. Random-size cropping : padding์„ 4๋กœ ํ•˜์—ฌ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๊ณ  ๋žœ๋ค์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ž˜๋ž๋‹ค. 2. Random horizontal flipping ์‚ฌ์šฉ 3. Input normalization : ํ‰๊ท  ์ฑ„๋„๊ฐ’์„ ๋นผ์„œ ์ •๊ทœํ™” 4. Train Data๋ฅผ 8:2๋กœ ๋‚˜๋ˆ ์„œ ํ•™์Šตํ•  ๋•Œ๋Š” Train set์„ ์‚ฌ์šฉํ•˜๊ณ  ํ•™์Šต ์ค‘๊ฐ„์ค‘๊ฐ„ Valldation set์„ ์‚ฌ์šฉํ•˜์—ฌ top-1 error๋ฅผ ์ธก์ •ํ•˜์˜€๋‹ค. ## ํ•™์Šต ์ธํ”„๋ผ - ๊ทธ๋ž˜ํ”ฝ ์นด๋“œ : 4080 Super - Mixed Percision ์‚ฌ์šฉ ## ์ตœ์ ํ™” ์„ค์ • - Optimizer : SGD ์‚ฌ์šฉ, Momentum 0.9 - Batch size : 512 - ํฌ๋กœ์Šค ์—”ํŠธ๋กœํ”ผ ์†์‹ค ํ•จ์ˆ˜ ์‚ฌ์šฉ - ์ดˆ๊ธฐ ํ•™์Šต๋ฅ  : 0.001 - Learning rate schedule : 5 epochs๋งˆ๋‹ค ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์—†์œผ๋ฉด x0.2, ์ตœ์†Œ 1e-6 | | Loss | Error | ์ด ํ•™์Šต ์‹œ๊ฐ„ | | ------------- | ------ | ------ | ---------- | | ResNet-50 | 0.3488 | 20.20% | 108m 57.1s | | SE-ResNet-50 | 0.2512 | 17.53% | 128m 38.5s | | ResNet + CBAM | 0.2599 | 17.81% | 172m 25.6s | ![[Pasted image 20251031154000.png]]