1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      PF-ViT: Parallel and Fast Vision Transformer for Offline Handwritten Chinese Character Recognition

      research-article
      , , ,
      Computational Intelligence and Neuroscience
      Hindawi

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Recently, Vision Transformer (ViT) has been widely used in the field of image recognition. Unfortunately, the ViT model repeatedly stacks 12-layer encoders, resulting in a large number of model computations, many parameters, and slow training speed, making it difficult to deploy on mobile devices. In order to reduce the computational complexity of the model and improve the training speed, a parallel and fast Vision Transformer method for offline handwritten Chinese character recognition is proposed. The method adds parallel branches of the encoder module to the structure of the Vision Transformer model. Parallel modes include two-way parallel, four-way parallel, and seven-way parallel. The original picture is fed to the encoder module after flattening and linear embedding processing operations. The core step in the encoder is the multihead attention mechanism. Multihead self-attention can learn the interdependence between image sequence blocks. In addition, the use of data expansion strategies increases the diversity of data. In the two-way parallel experiment, when the model is 98.1% accurate on the dataset, the number of parameters and the number of FLOPs are 43.11 million and 4.32 G, respectively. Compared with the ViT model, whose parameters and FLOPs are 86 million and 16.8 G, respectively, the two-way parallel model has a 50.1% decrease in parameters and a 34.6% decrease in FLOPs. This method has been demonstrated to effectively reduce the computational complexity of the model while indirectly improving image recognition speed.

          Related collections

          Most cited references33

          • Record: found
          • Abstract: not found
          • Book: not found

          2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Image Edge Detection: A New Approach Based on Fuzzy Entropy and Fuzzy Divergence

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found
              Is Open Access

              Vision Transformers for Remote Sensing Image Classification

              In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.
                Bookmark

                Author and article information

                Contributors
                Journal
                Comput Intell Neurosci
                Comput Intell Neurosci
                cin
                Computational Intelligence and Neuroscience
                Hindawi
                1687-5265
                1687-5273
                2022
                28 September 2022
                : 2022
                : 8255763
                Affiliations
                School of Electronic Information, Zhongyuan University of Technology, Zhengzhou 450007, Henan, China
                Author notes

                Academic Editor: Mario Versaci

                Author information
                https://orcid.org/0000-0003-1636-7416
                Article
                10.1155/2022/8255763
                9534625
                596ad55d-7e99-4a49-b543-13a1870c28df
                Copyright © 2022 Yongping Dan et al.

                This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 7 March 2022
                : 5 August 2022
                : 16 August 2022
                Categories
                Research Article

                Neurosciences
                Neurosciences

                Comments

                Comment on this article