UTFacultiesEEMCSDisciplines & departmentsPSEducationLanguage-guided Audio-Visual Source Separation

Language-guided Audio-Visual Source Separation

Language-guided Audio-Visual Source Separation

CONTEXT

The world is generally surrounded by audiovisual things, particularly multiple audio sources interleave with one another.  For example, at the theatre,  an orchestra of a large group of people led by a conductor performs a soundtrack on a variety of instruments, which thus create an enjoyable combination of complementary melodies for the audience. Humans have an innate ability to discern the melody of a single sound source from various instruments. In fact, this ability can enable a broad spectrum of essential but very useful tasks that would serve and accelerate human lives better. Specifically, its applications can be seen more nowadays, such as noise cancellation, speaker separation, and voice enhancement. Along with the proliferation of digital data, there are three typical modalities, namely visual, audio, and textual information that can help to augment the performance of effectively separating multiple sources. These things have been put forward further with the emergence of deep learning techniques.

Task

In this work, students are expected to review state-of-the-art work around this research briefly and then carefully analyze the strengths and weaknesses of respective ones. From that, they are positively encouraged to propose their own novel solutions that can significantly improve fidelity and efficiency in sound source separation.

YOU WILL GET

REQUIREMENTS

Contact:

Minh Son Nguyen, m.s.nguyen@utwente.nl