master Assignment
Will it fail? Generating videos of actions fails
Type: Master CS
Student: Unassigned
Duration: TBD
If you are interested please contact:
Background:
Generative models are used for creating images, video, and audio [1,2,3]. Advances such as controlled generation have allowed using text to generate corresponding videos (amongst other modalities). This project aims to use existing models and create adaptors [4] that can steer the generation towards positive outcomes similar to [5], i.e. actions in videos executed successfully as well as negative outcomes, i.e. fail cases in which the action is not performed as expected.
Denoising Diffusion Probabilistic Models (DDPMs) [2] use Encoder-Decoders to generate new images from noise. They work in two steps. In the first (forward) step they corrupt the input data by progressively adding noise. This results in a noisy embedding. In the backward step, they use the corrupted noisy embedding to reverse the effect of the first step and reconstruct the original input from the noisy representation. Since they learn to map inputs to noise, new inputs not in the training data can be generated by sampling noisy embeddings.
Low-ranking adapters (LoRA) [4] are modules that can be added on top of existing models by adding conditional information over network layers that are pre-trained. Instead of fine-tuning network layers layers they instead fuse information maintaining efficiency.
Objectives:
The project aims to generate plausible futures in video data based on the context of the action performed. The plausibility of each future will be adjusted with a slider that can produce videos with a perfect or failed execution of an action (max-min slider options), as well as all in-between performances.
Your profile:
You are a graduate student that is enthusiastic about generative models. You have previous experience in developing projects on DL frameworks. You are also interested in researching new directions and applying, testing, and analyzing the outcomes of your ideas.
Related works:
[1] Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, pp.6840-6851.
[2] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
[3] Fei, H., Wu, S., Ji, W., Zhang, H. and Chua, T.S., 2024. Dysen-VDM: Empowering Dynamics-aware Textto-Video Diffusion with LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7641-7653).
[4] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. msc_topics_notebook.md 2025-02-04 2 / 4
[5] Gandikota, R., MaterzyĆska, J., Zhou, T., Torralba, A. and Bau, D., 2025. Concept sliders: Lora adaptors for precise control in diffusion models. In European Conference on Computer Vision (pp. 172-188). Springer, Cham.