[Paper Review] ViT (Vision Transformer)

Transformer model has limited uses in CV
Findings: Transformer only model can perform very well in CV
- Significantly less computational resources needeed
- Comparable accuracy

In the NLP domain, self-attention based transformers performed great. Pre-training a large text corpus and then fine-tuning the model on a task-specific dataset was very effective
In the CV domain, various efforts were present, but still CNN based architectures were more effective
Training the model on mid-sized dataset? just modest
Training the model on large datasets? Can be even better than edge-cutting CNNs

Naive application of self-attention models to CV: not effective. each pixels relating to every other pixel producing toom much input for the model
Attention to Neighboring pixles: promising results
Most similar work with ViT: extracting 2 x 2 patches to embed them and use the attention model, but ViT can handle medium-resoltuion as well

imgur

Comparison of Models and the Size of Dataset
- ViT, ResNet, and Hybrid
- ViT adopts the configuration of BERT
- ViT model differs depending on the size of the patch. The smaller the patch, the more complicated the model is. Each sequence length is inversely proportional to the square of the patch size
- ViT model hyperparameters:
- ResNet with Batch Normalization for improvment
Comparison of Performance
- ViT-H/14 and 16: Huge ViT model with patch size of 14 and 16
- Hybrid Model, state of art in image classfication aside from ImageNet
- Noisy Student: CNN model, state of art in image classfication for ImageNet
- Result: Some ViT outperform other models still consuming significantly less training resource
Data Size and Performance of Pre-Trained Models
- The capacity of the large ViT models are fully utilized only for larger datasets (JFT-300M)
Scaling of Computational Resources
- ViT outperforms ResNet in terms of the computational power and the accuracy trade off
- The Hybrid model learns the task faster, but the computational; power and the accuracy trade off increases and ViT cathes up with larger computational resources

You may also enjoy