Pitfalls of Training Generative Models for Video: From Mode Collapse to Unstable Dynamics
This paper analyzes common pitfalls encountered during video GAN training and explores methods to mitigate them through hybrid loss functions. We focus on combining adversarial, pixel-wise reconstruction, perceptual, and temporal consistency losses to stabilize learning and improve the realism and coherence of generated video. An empirical study compares several loss configurations on a human action video dataset, using PSNR, LPIPS, FVD, and a custom temporal consistency metric.