[MB] Deriving text descriptions from visual model's predictions | Open Master Assignments

master Assignment

Deriving text descriptions from visual model's predictions

Type: Master CS

Student: Unassigned

Duration: TBD

If you are interested please contact:

Background:

Vision-language models (VLMs) [1,2,3] are now baselines for many tasks. However, it is unclear if these performance improvements rely more on text or vision aspects of inputs [4,5]. This project aims to explore the relevance on either vision or text inputs to recent contrastive objectives (e.g. SigLIP [6]).

Objectives:

You will develop a cross-modal relevance method for image-language models. The method should highlight if the model relies more on general text context (which might not always be relevant to the visual input) or msc_topics_notebook.md 2025-02-04 4 / 4 visual elements from the input. The project will address a currently open question on bias towards text for vision-language models.

Your profile:

You are a graduate student that is enthusiastic about discovering methods for model interpretability. You have previous experience in developing projects on DL frameworks. You are also enthusiastic about researching new directions and applying, testing, and analysing the outcomes of your ideas.

Related works:

[1] Li, F., Zhang, R., Zhang, H., Zhang, Y., Li, B., Li, W., Ma, Z. and Li, C., 2024. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895.

[2] Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C. and Yu, D., 2024. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.

[3] Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M. and Han, S., 2024. Vila: On pre-training for visual language models. CVPR

[4] Li, S., Koh, P.W. and Du, S.S., 2024. On Erroneous Agreements of CLIP Image Embeddings. arXiv preprint arXiv:2411.05195.

[5] Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R. and Nguyen, A.T., 2024. Vision language models are blind. ACCV

[6] Zhai, X., Mustafa, B., Kolesnikov, A. and Beyer, L., 2023. Sigmoid loss for language image pre-training. ICCV