Inference-Time Optimization for Fast and Accurate Visual Object Tracking

2025;
: pp. 1211–1220
Received: August 28, 2025
Revised: October 28, 2025
Accepted: November 20, 2025

Borsuk V., Yakovyna V.  Inference-Time Optimization for Fast and Accurate Visual Object Tracking.  Mathematical Modeling and Computing. Vol. 12, No. 4, pp. 1211–1220 (2025) 

1
Lviv Polytechnic National University
2
Lviv Polytechnic National University

Visual object tracking has recently benefited from the adoption of transformer architectures, which provide strong modeling capacity but incur high computational and memory costs, limiting real-time deployment.  Existing efficiency-focused trackers primarily address this challenge through architectural redesign, often trading accuracy for speed. In this work, we explore an alternative and complementary direction: inference-time optimization.  Using HiT as our baseline, we integrate memory-efficient attention into its hierarchical transformer blocks, reducing high-bandwidth memory accesses during self-attention without altering the model's representational capacity.  Our experiments show that the proposed optimization reduces average latency from 7.82 ms to 6.45 ms, increasing throughput from 127 FPS to 155 FPS, while preserving tracking accuracy.  These results demonstrate that inference-time optimizations can significantly improve the practicality of transformer-based trackers for real-time applications, opening a new path toward efficient, high-performance tracking beyond architectural modifications.

  1. Chen X., Yan B., Zhu J., Wang D., Yang X., Lu H.  Transformer Tracking.  2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  8122–8131 (2021).
  2. Yan B., Peng H., Fu J., Wang D., Lu H.  Learning Spatio-Temporal Transformer for Visual Tracking.  2021 IEEE/CVF International Conference on Computer Vision (ICCV).  10428–10437 (2021).
  3. Cui Y., Cheng J., Wang L., Wu G.  MixFormer: End-to-End Tracking with Iterative Mixed Attention.  2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13598–13608 (2022).
  4. Fan H., Bai H., Lin L., Yang F., Chu P., Deng G., Yu S., Harshit, Huang M., Liu J., Xu Y., Liao C., Yuan L., Ling H.  LaSOT: A High-quality Large-scale Single Object Tracking Benchmark.  129, 439–461 (2021).
  5. Huang L., Zhao X., Huang K.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild.  IEEE Transactions on Pattern Analysis and Machine Intelligence.  43 (5), 1562–1577 (2021).
  6. Borsuk V., Vei R., Kupyn O., Martyniuk T., Krashenyi I., Matas J.  FEAR: Fast, Efficient, Accurate and Robust Visual Tracker.  Computer Vision – ECCV 2022. 644-663 (2022).
  7. Kang B., Chen X., Wang D., Peng H., Lu H.  Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking.  2023 IEEE/CVF International Conference on Computer Vision (ICCV).  9578–9587 (2023).
  8. Yan B., Peng H., Wu K., Wang D., Fu J., Lu H.  LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search.   2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  15175–15184 (2021).
  9. Kristan M., Matas J., Leonardis A., Vojíř T., Pflugfelder R., Fernández G., Nebehay G., Porikli F., Čehovin L.  A Novel Performance Evaluation Methodology for Single-Target Trackers.  IEEE Transactions on Pattern Analysis and Machine Intelligence.  38 (11), 2137–2155 (2016).
  10. Wu Y., Lim J., Yang M.-H.  Online Object Tracking: A Benchmark.  2013 IEEE Conference on Computer Vision and Pattern Recognition.  2411–2418 (2013).
  11. Bertinetto L., Valmadre J., Golodetz S., Miksik O., Torr P. H. S.  Staple: Complementary Learners for Real-Time Tracking.  2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1401–1409 (2016).
  12. Henriques J. F., Caseiro R., Martins P., Batista J.  High-Speed Tracking with Kernelized Correlation Filters.  IEEE Transactions on Pattern Analysis and Machine Intelligence.  37 (3), 583–596 (2015).
  13. Vojir T., Noskova J., Matas J.  Robust Scale-Adaptive Mean-Shift for Tracking.  Image Analysis.  652–663 (2013).
  14. Bertinetto L., Valmadre J., Henriques J. F., Vedaldi A., Torr P. H. S.  Fully-Convolutional Siamese Networks for Object Tracking.  Computer Vision – ECCV 2016 Workshops.  850–865 (2016).
  15. Tian Z., Shen C., Chen H., He T.  FCOS: Fully Convolutional One-Stage Object Detection.  2019 IEEE/CVF International Conference on Computer Vision (ICCV).  9626–9635 (2019).
  16. Ye B., Chang H., Ma B., Shan S., Chen X.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework.  Computer Vision – ECCV 2022.  341–357 (2022).
  17. Chen X., Wang D., Li D., Lu H.  Efficient Visual Tracking via Hierarchical Cross-Attention Transformer.  Computer Vision – ECCV 2022 Workshops.  461–477 (2022).
  18. Dao T., Fu D. Y., Ermon S., Rudra A., Ré C.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.  Preprint arXiv:2205.14135 (2022).
  19. Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D.  Quantization and training of neural networks for efficient integer-arithmetic-only inference.  2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.  2704–2713 (2018).
  20. Molchanov P., Tyree S., Karras T., Aila T., Kautz J.  Pruning Convolutional Neural Networks for Resource Efficient Inference.  Preprint arXiv:1611.06440 (2016).
  21. Huerta R., Shoushtary M. A., Cruz J. L., Gonzalez A.  Dissecting and Modeling the Architecture of Modern GPU Cores.  MICRO '25: Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®. 369–384 (2025).
  22. Jouppi N. P., Young C., Patil N., Patterson D., Agrawal G., Bajwa R., Bates S., Bhatia S., Boden N., Borchers A., et al.  In-Datacenter Performance Analysis of a Tensor Processing Unit.  ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architectu.  1–12 (2017).