Pitfalls of Training Generative Models for Video: From Mode Collapse to Unstable Dynamics

2025;
: pp. 89 - 94
1
Lviv Polytechnic National University Department of Electronic Computational Machines
2
Lviv Polytechnic National University, Department of Electronic Computational Machines

This paper analyzes common pitfalls encountered during video GAN training and explores methods to mitigate them through hybrid loss functions. We focus on combining adversarial, pixel-wise reconstruction, perceptual, and temporal consistency losses to stabilize learning and improve the realism and coherence of generated video. An empirical study compares several loss configurations on a human action video dataset, using PSNR, LPIPS, FVD, and a custom temporal consistency metric. Results show that adding reconstruction and perceptual losses enhances fidelity and detail, while temporal loss reduces flicker and motion artifacts. The proposed hybrid loss has achieved balanced gains in fidelity and temporal
stability

  1. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative ad- versarial networks: An overview. IEEE signal processing magazine, 35(1), 53-65. DOI: https://doi.org/10.1109/ MSP.2017.2765202.
  2. Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. DOI: https://doi.org/10.48550/arXiv.1809.11096
  3. Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In Internatio- nal conference on machine learning (pp. 214-223). PMLR.DOI: https://doi.org/10.48550/arXiv.1701.07875
  4. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems, 30. DOI: https://doi.org/10.48550/arXiv.1704.00028
  5. Miyato, T., Kataoka, T., Koyama, M., &  Yoshida,  Y. (2018). Spectral normalization for  generative adversarial networks. arXiv preprint arXiv:1802.05957. DOI: https://doi.org/10.48550/arXiv.1802.05957
  6. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved  quality, stability, and variation. arXiv preprint arXiv:1710.10196.DOI: https://doi.org/10.48550/arXiv.1710.10196
  7. Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1526-1535). DOI: https://doi.org/10.1109/CVPR.2018.00162.
  8. Skorokhodov, I., Tulyakov, S., & Elhoseiny, M. (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3626-3636). DOI: https://doi.org/10. 1109/CVPR52729.2023.00970.
  9. Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of  video:  A  new  metric  & challenges. arXiv preprint arXiv:1812.01717. DOI: https://doi.org/10.48550/arXiv.1812.01717.
  10. Maksymiv M., Rak T. (2021). Methods to increase contrast while preserving visual quality, Advances in Cyber-Physical Systems, vol. 6(2). 45–52. DOI: https://doi.org/10.23939/acps2021.06.045.
  11. Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv preprint ar Xiv: 1907.06571. DOI: https://doi.org/10.48550/arXiv.1907.06571