In this paper, we address the challenge of visual geo-localization from low-quality UAV imagery captured in real world environments. We propose a two-stage architecture, which includes Super-Resolution and visual geo-localization. We introduced novel, non-learnable Ensemble Super-Resolution (ESR) module, which first refines upscaled aerial frames, then seamlessly feeds the enhanced imagery into a visual geo-localization pipeline. Designed as a parallelizable block integrated directly into any SR computation graph, ESR combines classical Bicubic interpolation with neural SR models – boosting image fidelity and overall system accuracy without additional training and executing efficiently on most hardware accelerator. We validate our approach on a dataset of 37 000 real-world UAV images, each downscaled by a factor of four and then restored via baseline methods (Bicubic, Bilinear, Nearest Neighbour, DRCT, HMA, HAT, SwinFIR) as well as our ESR-enhanced pipeline. Quantitative evaluation shows that standalone Super-Resolution methods yield PSNR in the low 20s dB and SSIM of 0.6–0.7 – far below standard benchmarks-leading to a marked drop in geo-localization accuracy (Recall@1 and AP).
In contrast, our ESR module stabilizes SR outputs and recovers image fidelity, raising geo-localization Recall@1 to 87.0 % (vs. 84.96 % with HMA restoration) and AP to 89.1 % (against 87.41 % with HMA restoration).
Our contributions are:
Two-stage framework combining Image Super-Resolution and visual geo-localization approaches tailored for low- resolution, noisy UAV data.
Non-learnable, parallelizable ESR block that fuses Bicubic interpolation with neural restoration within the network Super-Resolution graph – requiring no retraining and fully compatible with most accelerator.
Comprehensive empirical study demonstrating that ESR substantially narrows the domain gap and boosts geo- localization performance in real-world conditions.
We conclude that embedding lightweight, hardware-agnostic ensemble strategies into SR pipelines is a promising direction for robust UAV-based visual localization. Future work will explore adaptive ensemble weighting and domain- aware SR architectures to further mitigate aerial imaging noise and variability.
[1] Zheng, Z., Wei, Y., & Yang, Y. (2020). University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo- localization. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 1395-1403). Association for Computing Machinery. https://doi.org/10.1145/3394171.3413896
[2] Li, K., Yang, S., Dong, R., Wang, X., & Huang, J. (2020). Survey of single image super-resolution reconstruction. IET Image Processing, 14(11), 2273-2290. https://doi.org/10.1049/iet-ipr.2019.1438
[3] Hsu, C.C., Lee, C.M., & Chou, Y.S. (2024). DRCT: Saving Image Super-Resolution Away from Information Bottleneck. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (pp. 6133-6142). https://doi.org/10.1109/CVPRW63382.2024.00618
[4] Chen, X., Wang, X., Zhou, J., Qiao, Y., & Dong, C. (2023). Activating More Pixels in Image Super-Resolution Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 22367- 22377). https://doi.org/10.1109/CVPR52729.2023.02142
[5] Chu, S. C., Dou, Z. C., Pan, J. S., Weng, S., & Li, J. (2024). HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 6257-6266). https://doi.org/10.1109/CVPRW63382.2024.00629
[6] Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobing Wang, & Zhezhu Jin (2023). SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super- Resolution. https://doi.org/10.48550/arXiv.2208.11247
[7] Deuser, F., Habel, K., & Oswald, N. (2023). Sample4Geo: Hard Negative Sampling For Cross-View Geo-Localisation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 16801-16810). https://doi.org/10.1109/ICCV51070.2023.01545
[8] Lin, J., Zheng, Z., Zhong, Z., Luo, Z., Li, S., Yang, Y., & Sebe, N. (2022). Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Transactions on Image Processing, 31, 3780-3792. https://doi.org/10.1109/TIP.2022.3175601
[9] Wang, T., Zheng, Z., Yan, C., Zhang, J., Sun, Y., Zheng, B., & Yang, Y. (2022). Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(2), 867- 879. https://doi.org/10.1109/TCSVT.2021.3061265
[10] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). https://doi.org/10.1109/CVPR.2016.90
[11] Dai, M., Hu, J., Zhuang, J., & Zheng, E. (2022). A Transformer- Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4376-4389. https://doi.org/10.1109/TCSVT.2021.3135013
[12] Aäron van den Oord, Yazhe Li, & Oriol Vinyals (2018). Representation Learning with Contrastive Predictive Coding. ArXiv, abs/1807.03748. http://dx.doi.org/10.48550/arXiv.1807.03748
[13] Yang, J., Wright, J., Huang, T., & Ma, Y. (2010). Image Super-Resolution Via Sparse Representation. IEEE Transactions on Image Processing, 19(11), 2861-2873. https://doi.org/10.1109/TIP.2010.2050625
[14] Dong, C., Loy, C., He, K., & Tang, X. (2016). Image Super- Resolution Using Deep Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295-307. https://doi.org/10.1109/TPAMI.2015.2439281
[15] Dong, X. (2016). Accelerating the Super-Resolution Convolutional Neural Network. In Computer Vision - ECCV 2016 (pp. 391-407). Springer International Publishing. https://doi.org/10.1007/978-3-319-46475-6_25
[16] Kim, J., Lee, J., & Lee, K. (2016). Accurate Image Super- Resolution Using Very Deep Convolutional Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1646-1654). https://doi.org/10.1109/CVPR.2016.182
[17] Kim, J., Lee, J., & Lee, K. (2016). Deeply-Recursive Convolutional Network for Image Super-Resolution. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1637-1645). https://doi.org/10.1109/CVPR.2016.181
[18] Lim, B., Son, S., Kim, H., Nah, S., & Lee, K. (2017). Enhanced Deep Residual Networks for Single Image Super- Resolution. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1132- 1140). https://doi.org/10.1109/CVPRW.2017.151
[19] Zhang, Y., Tian, Y., Kong, Y., Zhong, B., & Fu, Y. (2018). Residual Dense Network for Image Super-Resolution. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2472-2481). https://doi.org/10.1109/CVPR.2018.00262
[20] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., & Fu, Y. (2018). Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In ECCV. https://doi.org/10.1007/978-3-030-01234-2_18
[21] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). SwinIR: Image Restoration Using Swin Transformer. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (pp. 1833-1844). https://doi.org/10.1109/ICCVW54120.2021.00210
[22] Chen, K., Li, L., Liu, H., Li, Y., Tang, C., & Chen, J. (2023). SwinFSR: Stereo Image Super-Resolution using SwinIR and Frequency Domain Knowledge. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1764-1774). https://doi.org/10.1109/CVPRW59228.2023.00177