Objective Paper:

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

0. Abstract

  • Transformer model has limited uses in CV
  • Findings: Transformer only model can perform very well in CV
    • Significantly less computational resources needeed
    • Comparable accuracy

1. Introduction

  • In the NLP domain, self-attention based transformers performed great. Pre-training a large text corpus and then fine-tuning the model on a task-specific dataset was very effective
  • In the CV domain, various efforts were present, but still CNN based architectures were more effective
  • Training the model on mid-sized dataset? just modest
  • Training the model on large datasets? Can be even better than edge-cutting CNNs

2. Preceding Works

  • Naive application of self-attention models to CV: not effective. each pixels relating to every other pixel producing toom much input for the model
  • Attention to Neighboring pixles: promising results
  • Most similar work with ViT: extracting 2 x 2 patches to embed them and use the attention model, but ViT can handle medium-resoltuion as well

3. Method

imgur

  • ViT aims to apply the original transformer model as much as possible
  • Model Structure
    1. Linear Projection of Flattened Patches
      • Original Image:
      • Single Patch Size:
      • Image Resolution: HW
      • Number of Channels: C
      • Number of Patches (Input Sequence Length for the Transformer): N =
      • Constant Latent Vector Size (Single Flattened Patch): D
      • Single Patch Before Flattening:
      • Trainable Linear Projection (for each patch):
      • Class Vector: =
      • Patch Position:
      • Input for Transformer Encoder:
    2. Alternation of MLP and MSA Layer
      • Layernorm (LN) is applied before every MLP and MSA block
      • MSA Layer =
      • MLP Layer =
    3. Output Layer

4. Experiments

  • Comparison of Models and the Size of Dataset
    • ViT, ResNet, and Hybrid
    • ViT adopts the configuration of BERT
    • ViT model differs depending on the size of the patch. The smaller the patch, the more complicated the model is. Each sequence length is inversely proportional to the square of the patch size
    • ViT model hyperparameters:
      Imgur
    • ResNet with Batch Normalization for improvment
  • Comparison of Performance Imgur
    • ViT-H/14 and 16: Huge ViT model with patch size of 14 and 16
    • Hybrid Model, state of art in image classfication aside from ImageNet
    • Noisy Student: CNN model, state of art in image classfication for ImageNet
    • Result: Some ViT outperform other models still consuming significantly less training resource
  • Data Size and Performance of Pre-Trained Models Imgur
    • The capacity of the large ViT models are fully utilized only for larger datasets (JFT-300M)
  • Scaling of Computational Resources Imgur
    • ViT outperforms ResNet in terms of the computational power and the accuracy trade off
    • The Hybrid model learns the task faster, but the computational; power and the accuracy trade off increases and ViT cathes up with larger computational resources