[M] Uni-modal to multi-modal model merging

master Assignment

Uni-modal to multi-modal model merging

Type: Master CS

Student: Unassigned

Duration: TBD

If you are interested please contact:

Background:

Attention-based models are used far-and wide for multiple tasks (e.g. classification, detection, QA etc.) and modalities (e.g. images, text, audio, video, etc.). In most cases models when training models in one modality and another model in a different modality e.g. audio, you will need to re-train a joint model to merge the two. This project aids to investigate the feasibility of test-time model merging. Instead training a single multimodal model from scratch, you will extend techniques for merging models such as layer-merging [1,2,3] or token pruning [4] to multiple modalities.

Objectives:

This project aids at foundational work on training-free model merging. Instead of developing a new model from scratch you will be able to combine multiple models (ie modality-experts) to a single model (ie multimodal oracle).

Your profile:

You are a graduate student with prior experience in DL frameworks (e.g. pytorch). You are also enthusiastic about foundational research towards new directions by applying, testing, and analysing the outcomes of your ideas.

Related works:

[1] Stoica, G., Bolya, D., Bjorner, J., Ramesh, P., Hearn, T. and Hoffman, J., 2023. Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053.

[2] Davari, M. and Belilovsky, E., 2024, September. Model breadcrumbs: Scaling multi-task model merging with sparse masks. In European Conference on Computer Vision (pp. 270-287). Cham: Springer Nature Switzerland.

[3] Stoica, G., Ramesh, P., Ecsedi, B., Choshen, L. and Hoffman, J., 2024. Model merging with SVD to tie the Knots. arXiv preprint arXiv:2410.19735.

[4] Kim, M., Gao, S., Hsu, Y.C., Shen, Y. and Jin, H., 2024. Token fusion: Bridging the gap between token pruning and token merging. WACV