Vision Transformers; Self-Supervised Learning; Cross-Modal Learning; Computer Vision; Deep Learning

SELF-SUPERVISED VISION TRANSFORMERS FOR CROSS-MODAL LEARNING (REVIEW)

Computer vision systems are increasingly expanding their application in visual data analysis. Model training methods are undergoing the greatest development and improvement as the results of this stage significantly affect the final classification of objects and the interpretation of input information. Typically, computer vision systems use convolutional neural networks for training (Convolution Neural Network, CNN).