[B] Controlling Video Generation for Diffusion Models

Master Assignment

[B] Controlling Video Generation for Diffusion Models

Type: Master EE/CS/ITC

Period: TBD

Student: (Unassigned)

If you are interested please contact :

Background:

Diffusion probabilistic models [1] have achieved great results in synthesising high-quality images [2,3], surpassing previous generative models such as GANs [4] for most vision-based generation tasks. Diffusion models treat the task of generating data as denoising, where noise is gradually removed from the input. The success of such models has also extended to videos [5,6] where in most cases a large-scale image diffusion model is adapted to also generate consistent motions across frames (with e.g. [7]).

The field of video generation is ever-growing (as shown with the recent advancements here [8]) since there are direct applications in various types of media, applications, and industries. This project aims to explore (low-cost) methods for controlling existing image-based Diffusion models to generate consistent videos across various durations.

Objectives:

You will work on video generation specifically using existing pre-trained image-based diffusion models, adapting them for video generation.

Your profile:

You are a graduate student with solid understanding of Computer Vision and Machine Learning approaches and methods. You have prior experience in developing and applying vision-based models. The scope of the project is broad but follows closely recent advancements in the subfield of image/video generation, so you will also develop your research skills in tandem with coding. Prior experience on generative models (e.g. GANs) is a plus.

Related works:

Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, pp.6840-6851.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T. and Ho, J., 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, pp.36479-36494.
Dhariwal, P. and Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34, pp.8780-8794.
Ceylan, D., Huang, C.H.P. and Mitra, N.J., 2023. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 23206-23217).
Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X. and Shou, M.Z., 2023. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7623-7633).
Zhang, L., Rao, A. and Agrawala, M., 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3836-3847).
Brooks T., Peebles B., Homes C., DePue W., Guo Y., Jing L., Schnurr D., Taylor J., Luhman T., Luhman E., Ng C.W.Y., Wang R., Ramesh A., Video generation models as world simulators. URL: https://openai.com/research/video-generation-models-as-world-simulators