Contrastive Language-Image Pre-Training (CLIP) in E- Commerce: Applications, Methodologies, and Performance

2025;
: pp. 100 - 104
1
Lviv Polytechnic National University, Ukraine
2
Lviv Polytechnic National University, Ukraine, Uniwersytet Rolniczy im. Hugona Kołłątaja
3
Lviv Polytechnic National University, Ukraine, Comenius University Bratislava

This article thoroughly examines the architecture and applications of the Contrastive Language-Image Pre-training (CLIP) model within the e-commerce domain, focusing on key tasks such as visual search, product recommendation, and attribute extraction. The article also provides an in-depth analysis of the methodologies used for CLIP’s adaptation to e-commerce tasks and the relevant datasets employed. By highlighting the unique capabilities of the CLIP model, such as its ability to perform zero-shot learning and contrastive pre-training, this article underscores its potential impact on the industry while also acknowledging its limitations, including the ‘domain gap’ and the need for adaptation strategies. Furthermore, the article explores the future research directions for enhancing CLIP’s performance in specialized e-commerce contexts and compares it with other traditional and multimodal AI techniques.

  1. Czerwinska, U., Bircanoglu, C., & Chamoux, J. (2025). Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs [Preprint]. arXiv. DOI: https://doi.org/10.48550/arXiv.2504.07567
  2. Hendriksen, M., Bleeker, M., Vakulenko, S., Van Noord, N., Kuiper, E., & De Rijke, M. (2022, April). Extending CLIP for  Category-to-image  Retrieval  in  E-commerce. In European Conference on Information Retrieval (pp. 289-303). DOI: https://doi.org/10.48550/arXiv.2112.11294
  3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). DOI: https://doi.org/10.48550/ arXiv.2103.00020
  4. Tóth, S., Wilson, S., Tsoukara, A., Moreu, E., Masalovich, A., & Roemheld, L. (2024). End-to-end multi-modal product matching in fashion e-commerce. arXiv preprint arXiv:2403.11593. DOI: https://doi.org/10.48550/ arXiv.2403.11593
  5. Ling, X., Peng, B., Du, H., Zhu, Z., & Ning, X. (2024).Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data. arXiv preprint arXiv:2410.17337. DOI: https://doi.org/10.48550/ arXiv.2410.17337
  6. Ma, H., Zhao, H., Lin, Z., Kale, A., Wang, Z., Yu, T., Gu, J., Choudhary, S., & Xie, X. (2022). EI-CLIP: Entity- Aware Interventional Contrastive Learning for E- Commerce Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18051-18061). DOI: https://doi.org/10.1109/CVPR52688.2022.01752
  7. Lin, J., Du, P., Liu, J., Li, W., Yu, Y., Zhang, W., & Cao, Y. (2025). Sell It Before You Make It: Revolutionizing E-Com- merce with Personalized AI-Generated Items [Preprint]. arXiv. DOI: https://doi.org/10.48550/arXiv.2503.22182
  8. Gong, J., Cheng, M., Shen, H., Vandenbussche, P.-Y., Jenq, J., & Eldardiry, H. (2025). Visual Zero-Shot E-Commerce Product Attribute Value Extraction [Preprint]. arXiv. DOI: https://doi.org/10.48550/arXiv.2502.15979
  9. Khandelwal, A., Mittal, H., Kulkarni, S. S., & Gupta, D. (2023). Large Scale Generative Multimodal Attribute Ex- traction for E-commerce Attributes. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol. 5: Industry Track) (pp. 305–312). DOI: https://doi.org/10.18653/v1/2023.acl-industry.29
  10. Jia, Q., Liu, Y., Xu, S., Liu, H., Wu, D., Fu, J., Vollgraf, R., & Wang, B. (2023). KG-FLIP: Knowledge-guided Fashion-domain Language-Image Pre-training for E- commerce. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track) (pp. 81–88). DOI: https://doi.org/ 10.18653/v1/2023.acl-industry.9
  11. Hu, J., Gong, J., Shen, H., & Eldardiry, H. (2025, April). Hypergraph-based Zero-shot Multi-modal Product Attribute Value Extraction. In Proceedings of the ACM on Web Conference 2025 (pp. 4853-4862). DOI: https://doi.org/10.1145/3696410.3714714
  12. Cheng, Z., Zhang, W., Chou, C. C., Jau, Y. Y., Pathak, A., Gao, P., & Batur, U. (2024, November). E-commerce product categorization with LLM-based dual-expert classi- fication paradigm. In Proceedings of the 1st Workshop on Customizable    NLP:    (CustomNLP4U) (pp.    294-304).DOI:https://doi.org/10.18653/v1/2024.customnlp4u-1.22