Inference-Time Optimization for Fast and Accurate Visual Object Tracking

V. Borsuk; V. Yakovyna

Visual object tracking has recently benefited from the adoption of transformer architectures, which provide strong modeling capacity but incur high computational and memory costs, limiting real-time deployment. Existing efficiency-focused trackers primarily address this challenge through architectural redesign, often trading accuracy for speed. In this work, we explore an alternative and complementary direction: inference-time optimization. Using HiT as our baseline, we integrate memory-efficient attention into its hierarchical transformer blocks, reducing high-bandwidth memory accesses during self-attention without altering the model's representational capacity. Our experiments show that the proposed optimization reduces average latency from 7.82 ms to 6.45 ms, increasing throughput from 127 FPS to 155 FPS, while preserving tracking accuracy. These results demonstrate that inference-time optimizations can significantly improve the practicality of transformer-based trackers for real-time applications, opening a new path toward efficient, high-performance tracking beyond architectural modifications.

visual object tracking

transformers

inference-time optimization

real-time tracking

deep learning

efficient computer vision