Transformer-Based Network for Robust 3D Industrial Environment Understanding in Autonomous UAV Systems

Oleksii Kuchkin; Artem Sazonov; Iryna Cherepanska; Anatoliy Zhuchenko

Autonomous navigation of unmanned aerial vehicles (UAVs) in unstructured industrial environments remains challenging due to irregular geometry, dynamic obstacles and sensor uncertainty. Classical Simultaneous Localization and Mapping (SLAM) systems, though geometrically consistent, often fail under poor initialization, textureless areas or reflective surfaces. To overcome these issues, this work proposes a hybrid transformer-geometric framework that fuses learned scene priors with keyframe-based SLAM. A TinyViT encoder and lightweight multi-task decoder jointly estimate inverse depth, surface normals and semantic segmentation, providing dense geometric and semantic cues that stabilize localization and mapping. These priors are incorporated into the SLAM optimization to enhance convergence, reject dynamic objects and improve relocalization. The system operates near real-time (~1 FPS) on a Raspberry Pi 5 CPU, suitable for keyframe-level inference. Experiments show robust localization and consistent mapping in cluttered, reflective and dynamic industrial scenes, confirming that transformer-based dense perception effectively complements classical SLAM for resource-efficient UAV navigation.

UAV robust control

computer vision

SLAM (Simultaneous Localization and Mapping)

transformer-based neural network

autonomous navigation

3D environment understanding

Pistun, Y. , Lesovoy, L. , Matiko, F., Fedoryshyn, R. (2014). Computer Aided Design of Differential Pressure Flow Meters. World Journal of Engineering and Technology, 2, 68-77. https://doi.org/10.4236/wjet.2014.22009
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly and J. Uszkoreit. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proc. IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, Oct. 10‑17 2021, pp. 10012‑10022. https://doi.org/10.1109/ICCV48922.2021.00986
Z. Teed and J. Deng. (2021). DROID‑SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB‑D Cameras. Advances in Neural Information Processing Systems (NeurIPS), vol. 34/35. arXiv:2108.10869
Z. Zhu, S. Peng, et al. (2022). NICE‑SLAM: Neural Implicit Scalable Encoding for SLAM. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2112.12130. https://doi.org/10.1109/CVPR52688.2022.01245
J. Czarnowski, T. Laidlow, R. Clark and A. J. Davison. (2020). DeepFactors: Real‑Time Probabilistic Dense Monocular SLAM. IEEE Robot. Autom. Lett. arXiv:2001.05049. https://doi.org/10.1109/LRA.2020.2965415
X. Zhai, J. Wu, Y. Wang, K. Ye, S. Ruan, et al. (2021). Scaling Vision Transformers. arXiv preprint arXiv:2106.04560.
Y. Chen, C.–F. Chen, Z. Dong, T. Wu, et al. (2021). CrossViT: Cross‑Attention Multi‑Scale Vision Transformer for Image Classification. Proc. IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV48922.2021.00041
X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen and B. Guo. (2021). CSWin Transformer: A General Vision Transformer Backbone with Cross‑Shaped Windows. arXiv preprint arXiv:2107.00652. https://doi.org/10.1109/CVPR52688.2022.01181
X. Zhang, Y. Tian, W. Huang, Q. Ye, L. Xie and Q. Tian. (2022). HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling. arXiv preprint arXiv:2205.14949.
A. Hassani and H. Shi. (2022). Dilated Neighborhood Attention Transformer. arXiv preprint arXiv:2209.15001.
X. Yu, et al. (2023). Mix‑ViT: Mixing Attentive Vision Transformer for Ultra‑Fine‑Grained Visual Classification. Signal Processing, vol. 215. https://doi.org/10.1016/j.patcog.2022.109131
X. Bai, Z. Hu, X. Zhu, et al. (2022). TransFusion: Robust LiDAR‑Camera Fusion for 3D Object Detection with Transformers. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2203.11496. https://doi.org/10.1109/CVPR52688.2022.00116
Z. Li, W. Wang, E. Xie, et al. (2022). BEVFormer: Learning Bird’s‑Eye‑View Representation from Multi‑Camera Images via Spatio‑Temporal Transformers. Proc. European Conference on Computer Vision (ECCV). arXiv:2203.17270
A. Kirillov, E. Mintun, N. Ravi, et al. (2023). Segment Anything (SAM). arXiv preprint arXiv:2304.02643. https://doi.org/10.1109/ICCV51070.2023.00371
Y. Xiong, B. Varadarajan, et al. (2024). EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2312.00863. https://doi.org/10.1109/CVPR52733.2024.01525
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. (2012). NYU Depth Dataset V2. [Online]. Available: https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
A. Dai, A. X. Chang, M. Savva, et al. (2017). ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. https://doi.org/10.1109/CVPR.2017.261
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. (2012). A Benchmark for the Evaluation of RGB-D SLAM Systems. https://doi.org/10.1109/IROS.2012.6385773