[B] Discovering and Recommending Relevant Audio and Videos

Master Assignment

[B] Discovering and Recommending Relevant Audio and Videos

Type: Master EE/CS/ITC

Period: TBD

Student: (Unassigned)

If you are interested please contact :

Background:

Music is an essential component in the majority of the media that we daily consume. For example, soundtracks in films, background music in YouTube videos, or audio snippets in social media posts. Across all these mediums, the selection of the right music can be difficult as it needs to match both the themes of the videos as well as the scene within the video itself. Thus, an automated tool that can recommend relevant music can provide a simple solution to both professional as well as amateur editors. The inverse of the task also presents interest, as music artists may require video footage to find videos that can artistically match the music playing.

Audio and video have a physical correspondence as they are both signals transmitted over time. However, discovering semantic associations between them needs to go beyond their physical attributes. The interplay of the factors that are crucial of associating audio with video are important for understanding and exploring their relations.

Objectives:

You will develop a model that can learns context from audio and video and explores relations between the two modalities. The objectives also include a detailed analysis to give insights as to the level that the correspondence is learned.

Your profile:

You are a graduate student with a fair experience in Computer Vision and Machine Learning. You should be a capable programmer with prior experience of using ML packages (e.g. pytorch). Part of the project requires critical thinking and exploring new directions for both encoding video and audio, so you will also have the opportunity to go beyond current approaches.

Related works:

  1. Surís, D., Vondrick, C., Russell, B. and Salamon, J., 2022. It's Time for Artistic Correspondence in Music and Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10564-10574).
  2. Zhuo, L., Wang, Z., Wang, B., Liao, Y., Bao, C., Peng, S., Han, S., Zhang, A., Fang, F. and Liu, S., 2023. Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15637-15647).
  3. McKee, D., Salamon, J., Sivic, J. and Russell, B., 2023. Language-Guided Music Recommendation for Video via Prompt Analogies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14784-14793).
  4. Su, K., Li, J.Y., Huang, Q., Kuzmin, D., Lee, J., Donahue, C., Sha, F., Jansen, A., Wang, Y., Verzetti, M. and Denk, T., 2024. V2Meow: Meowing to the Visual Beat via Video-to-Music Generation.