A survey of the Vision Transformers and its CNN-Transformer based Variants

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Vision transformers have recently become popular as a possible alternative to convolutional neural networks (CNNs) for a variety of computer vision applications. These vision transformers due to their ability to focus on global relationships in images have large capacity, but may result in poor generalization as compared to CNNs. Very recently, the hybridization of convolution and self-attention mechanisms in vision transformers is gaining popularity due to their ability of exploiting both local and global image representations. These CNN-Transformer architectures also known as hybrid vision transformers have shown remarkable results for vision applications. Recently, due to the rapidly growing number of these hybrid vision transformers, there is a need for a taxonomy and explanation of these architectures. This survey presents a taxonomy of the recent vision transformer architectures, and more specifically that of the hybrid vision transformers. Additionally, the key features of each architecture such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. This survey highlights the potential of hybrid vision transformers to achieve outstanding performance on a variety of computer vision tasks. Moreover, it also points towards the future directions of this rapidly evolving field.

Related collections

Author and article information

Journal

Publication date Created: 16 May 2023

Article

ArXiV ID: 2305.09880

SO-VID: c19c908e-015a-4b84-b1d9-c5376b838113

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments Pages: 58, Figures: 14

Categories cs.CV

ScienceOpen disciplines: Computer vision & Pattern recognition

Data availability:

ScienceOpen disciplines: Computer vision & Pattern recognition

A survey of the Vision Transformers and its CNN-Transformer based Variants

Read this article at

Abstract

Related collections

Recursive Rule based Visual Categorization

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 95