Transformer-Based Network for Robust 3D Industrial Environment Understanding in Autonomous UAV Systems

2025;
: pp. 210 – 216
Received: October 15, 2025
Revised: December 11, 2025
Accepted: December 18, 2025
1
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”
2
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”
3
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”
4
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Autonomous navigation of unmanned aerial vehicles (UAVs) in unstructured industrial environments remains challenging due to irregular geometry, dynamic obstacles and sensor uncertainty. Classical Simultaneous Localization and Mapping (SLAM) systems, though geometrically consistent, often fail under poor initialization, textureless areas or reflective surfaces. To overcome these issues, this work proposes a hybrid transformer-geometric framework that fuses learned scene priors with keyframe-based SLAM. A TinyViT encoder and lightweight multi-task decoder jointly estimate inverse depth, surface normals and semantic segmentation, providing dense geometric and semantic cues that stabilize localization and mapping. These priors are incorporated into the SLAM optimization to enhance convergence, reject dynamic objects and improve relocalization. The system operates near real-time (~1 FPS) on a Raspberry Pi 5 CPU, suitable for keyframe-level inference. Experiments show robust localization and consistent mapping in cluttered, reflective and dynamic industrial scenes, confirming that transformer-based dense perception effectively complements classical SLAM for resource-efficient UAV navigation.

  1. Pistun, Y. , Lesovoy, L. , Matiko, F., Fedoryshyn, R. (2014). Computer Aided Design of Differential Pressure Flow Meters. World Journal of Engineering and Technology, 2, 68-77. doi: 10.4236/wjet.2014.22009.
  2. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly and J. Uszkoreit. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929. 
  3. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proc. IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, Oct. 10‑17 2021, pp. 10012‑10022. 
  4. Z. Teed and J. Deng. (2021). DROID‑SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB‑D Cameras. Advances in Neural Information Processing Systems (NeurIPS), vol. 34/35. arXiv:2108.10869
  5. Z. Zhu, S. Peng, et al. (2022). NICE‑SLAM: Neural Implicit Scalable Encoding for SLAM. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  arXiv:2112.12130
  6. J. Czarnowski, T. Laidlow, R. Clark and A. J. Davison. (2020). DeepFactors: Real‑Time Probabilistic Dense Monocular SLAM. IEEE Robot. Autom. Lett. arXiv:2001.05049
  7. X. Zhai, J. Wu, Y. Wang, K. Ye, S. Ruan, et al. (2021). Scaling Vision Transformers. arXiv preprint arXiv:2106.04560. 
  8. Y. Chen, C.–F. Chen, Z. Dong, T. Wu, et al. (2021). CrossViT: Cross‑Attention Multi‑Scale Vision Transformer for Image Classification. Proc. IEEE/CVF International Conference on Computer Vision (ICCV)https://doi.org/10.48550/arXiv.2103.14899
  9. X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen and B. Guo. (2021). CSWin Transformer: A General Vision Transformer Backbone with Cross‑Shaped Windows. arXiv preprint arXiv:2107.00652. 
  10. X. Zhang, Y. Tian, W. Huang, Q. Ye, L. Xie and Q. Tian. (2022). HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling. arXiv preprint arXiv:2205.14949. 
  11. A. Hassani and H. Shi. (2022). Dilated Neighborhood Attention Transformer. arXiv preprint arXiv:2209.15001. 
  12. X. Yu, et al. (2023). Mix‑ViT: Mixing Attentive Vision Transformer for Ultra‑Fine‑Grained Visual Classification. Signal Processing, vol. 215. https://doi.org/10.1016/j.patcog.2022.109131
  13. X. Bai, Z. Hu, X. Zhu, et al. (2022). TransFusion: Robust LiDAR‑Camera Fusion for 3D Object Detection with Transformers. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  arXiv:2203.11496
  14. Z. Li, W. Wang, E. Xie, et al. (2022). BEVFormer: Learning Bird’s‑Eye‑View Representation from Multi‑Camera Images via Spatio‑Temporal Transformers. Proc. European Conference on Computer Vision (ECCV).  arXiv:2203.17270
  15. A. Kirillov, E. Mintun, N. Ravi, et al. (2023). Segment Anything (SAM). arXiv preprint arXiv:2304.02643. 
  16. Y. Xiong, B. Varadarajan, et al. (2024). EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  arXiv:2312.00863
  17. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. (2012). NYU Depth Dataset V2. [Online]. Available: https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
  18. A. Dai, A. X. Chang, M. Savva, et al. (2017). ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. [Online]. Available: http://www.scan-net.org/
  19. J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. (2012). A Benchmark for the Evaluation of RGB-D SLAM Systems. [Online]. Available: https://vision.in.tum.de/data/datasets/rgbd-dataset