Pitfalls of training generative models for video: from mode collapse to unstable dynamics

M. R. Maksymiv; Тарас Рак

This paper analyzes common pitfalls encountered during video GAN training and explores methods to mitigate them through hybrid loss functions. We focus on combining adversarial, pixel-wise reconstruction, perceptual, and temporal consistency losses to stabilize learning and improve the realism and coherence of generated video. An empirical study compares several loss configurations on a human action video dataset, using PSNR, LPIPS, FVD, and a custom temporal consistency metric. Results show that adding reconstruction and perceptual losses enhances fidelity and detail, while temporal loss reduces flicker and motion artifacts. The proposed hybrid loss has achieved balanced gains in fidelity and temporal
stability

generative adversarial network

Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative ad- versarial networks: An overview. IEEE signal processing magazine, 35(1), 53-65.https://doi.org/10.1109/MSP.2017.2765202
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In Internatio- nal conference on machine learning (pp. 214-223). PMLR.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems, 30.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1526-1535).https://doi.org/10.1109/CVPR.2018.00162
Skorokhodov, I., Tulyakov, S., & Elhoseiny, M. (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3626-3636).https://doi.org/10.1109/CVPR52688.2022.00361
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717.
Maksymiv M., Rak T. (2021). Methods to increase contrast while preserving visual quality, Advances in Cyber-Physical Systems, vol. 6(2). 45-52.https://doi.org/10.23939/acps2021.02.140
Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv preprint ar Xiv: 1907.06571