Sub-token ViT Embedding via Stochastic Resonance Transformers

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.

Related collections

Author and article information

Journal

Publication date Created: 05 October 2023

Publication date Updated: 2024-05-06

Article

ArXiV ID: 2310.03967

SO-VID: c02e7e88-1647-4b06-a958-5146e57d6a64

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Categories cs.CV cs.AI

ScienceOpen disciplines: Computer vision & Pattern recognition,Artificial intelligence

Data availability:

ScienceOpen disciplines: Computer vision & Pattern recognition, Artificial intelligence

Sub-token ViT Embedding via Stochastic Resonance Transformers

Read this article at

Abstract

Related collections

Semantic Knowledge Base

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 183