SELF-SUPERVISED VISION TRANSFORMERS FOR CROSS-MODAL LEARNING (REVIEW)

Olena Stankevych; Danylo Matviikiv

Computer vision systems are increasingly expanding their application in visual data analysis. Model training methods are undergoing the greatest development and improvement as the results of this stage significantly affect the final classification of objects and the interpretation of input information. Typically, computer vision systems use convolutional neural networks for training (Convolution Neural Network, CNN). The disadvantages of such systems are significant limitations in cross-modal learning, multimodality implementation, labeling of large amounts of data, etc. One of the ways to overcome these problems is to use Vision Transformers (ViT), which, compared to classical CNNs, have higher performance due to reduced inductive biases and high parallel computing efficiency. Introducing Self-Supervised Learning (SSL) technologies can significantly reduce the dependence on manually labeled data, contributing to the formation of generalized representations of images. Cross-Modal Learning (CML) expands the possibilities of processing them by combining data of different types. The development of the new approach, combined with the capabilities of cross-modal learning and self-learning in ViT in a single architecture, will ensure adaptability, efficiency, and system scalability in various applications. The research aims to provide a detailed overview of ViTs, approaches to their architecture, and methods for improving their efficiency. The mathematical foundations of the key concepts of ViT, cross-modal learning and self-learning, the main modifications of ViT, and their integration with SSL and CML technologies are considered. A comparison of methods using characteristics, performance, and efficiency is provided. The key challenges and prospects facing researchers and developers while creating universal models in computer vision are outlined. ViTs change computer vision by capturing global dependencies on images. Despite some challenges, ViTs provide excellent scalability and performance for large datasets. The active search for methods to overcome their limitations makes ViTs a key tool for improving image classification, object detection, and other computer vision tasks.

Vision Transformers; Self-Supervised Learning; Cross-Modal Learning; Computer Vision; Deep Learning

[1] R. Szeliski, “Computer Vision: Algorithms and Applications”, 2nd ed., Springer Cham, 2022, XXII, 925 p.

[2]  A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, 2021, arXiv preprint, arXiv:2010.11929. https://arxiv.org/abs/2010.11929

[3]  T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations”, Proc. 37th Int. Conf. on Machine Learning (ICML), 2020,  P. 1597–1607.

[4]  A. Radford, J. W. Kim, C. Hallacy, and A. Ramesh, “Learning Transferable Visual Models from Natural Language Supervision”, Proc. of the 38th Int. Conf. on Machine Learning (ICML), 2021, arXiv preprint, arXiv:2103.00020. https://arxiv.org/abs/2103.00020

[5]  A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention Is All You Need”, NIPS’17: Proc. Of the 31^st Int. Conf. on Neural Information Processing Systems, 2017, P. 6000–6010.

[6] H. Touvron, M. Cord, M. Douze, et al., “Training data-efficient image transformers & distillation through attention”, Proc. 38th Int. Conf. on Machine Learning (PMLR), 2021, Vol. 139, P. 10347–10357.

[7]  Z. Liu, Y. Lin, Y. Cao, and H. Hu, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, arXiv preprint, arXiv:2103.14030. https://arxiv.org/pdf/2103.14030

[8]. K. Han, A. Xiao, E. Wu, et al., “Transformer in Transformer”, Advances in Neural Information Processing Systems (NeurIPS), 2021, Vol.  34. https://arxiv.org/abs/2103.00112

[9]  W. Wang, E. Xie, X. Li, et al., “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions”, Proc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2021, P.  568–578.

[10]  H. Wu, B. Xiao, N. Codella, et al., “CvT: Introducing Convolutions to Vision Transformers”, Proc. IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2021, P.  22–31.

[11]  C. Jia, Yi. Yang, Ye Xia, et al., “Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision”, Proc. of the 38th Int. Conf. on Machine Learning (PMLR) 2021, Vol. 139, P. 4904–4916.

[12]  K. He, X. Chen, S. Xie, Yet al., “Masked Autoencoders Are Scalable Vision Learners”, arXiv preprint, arXiv:2111.06377, 2022. https://arxiv.org/abs/2111.06377

[13] A. Kirillov, E. Mintun, N. Ravi, et al., “Segment Anything”, Proc. of the IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2023, https://doi.org/10.1109/ICCV51070.2023.00371

[14]  M. Caron, H. Touvron, I. Misra et al., “DINOv2: Learning Robust Visual Features without Labels”, arXiv preprint, arXiv:2304.07193, 2023. https://arxiv.org/abs/2304.07193

[15] X. Liu, H. Peng, N. Zheng, et al., “EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention”, Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, P.  14421–14430. https://arxiv.org/abs/2305.07027

[16]  S. Mehta, and M. Rastegari, “MobileViT v2: An Efficient Neural Architecture for Mobile Vision”, 2023, arXiv preprint, arXiv:2206.02680. https://arxiv.org/abs/2206.02680

[17]  Y. Zhang, X. Li, S. Lin, and K. He, “You Only Need Less Attention at Each Stage in Vision Transformers”, Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2024.

[18]  H. Guo, Yu. Wang, Z. Ye, Ji. Dai, and Yu. Xiong, “big.LITTLE ViT: Dynamic Token Routing for Vision Transformers”, arXiv preprint, arXiv:2410.10267, 2024. https://arxiv.org/abs/2410.10267

[19]  A. Akbari, L. Yuan, R. Qian, et al., “VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text”, Proc of the 35^th Int. Conf. on Advances in Neural Information Processing Systems (NeurIPS), 2021, arXiv preprint, arXiv:2104.11178.

[20]  M. Caron, H. Touvron, I. Misra, et al., “Emerging Properties in Self-Supervised Vision Transformers”, 2021, arXiv preprint, arXiv:2104.14294.

[21]  K. Mu, R. Salakhutdinov, and H. Fan, “SLIP: Self-supervision meets Language-Image Pre-training”, 2022, arXiv preprint, arXiv:2112.12750.

[22]  Ji. Yu, Z. Wang, V. Vasudevan, et al., “CoCa: Contrastive Captioners are Image-Text Foundation Models”, 2022, arXiv preprint, arXiv:2205.01938.

[23]  R. Azad, A. Kazerouni, M. Heidari, et al., “Advances in medical image analysis with vision Transformers: A comprehensive review”, Medical Image Analysis, 2024, 91, 103000.

[24] D.  Shan, and G. Chen, “GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets”, 2024, arXiv preprint, arXiv:2404.04924.

[25]  R. Girdhar, A. El-Nouby, Z. Liu, et al., “ImageBind: One Embedding Space to Bind Them All”, Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, arXiv preprint, arXiv:2305.05665.