Diving video classification with VLMs foundation models

master Assignment

Type: Master CS

Student: Unassigned

Duration: TBD

If you are interested please contact:

Background:

In recent years, a video understanding task that has gained popularity is fine-grained video classification. Fine-grained classifications of actions can be a challenge, such as in the diving video dataset Diving48.

In this project, you should investigate how action classification and large-scale self-supervised models encode videos and how these different encodings impact performance gains.

You can look at:

Meta’s recent self-supervised V-JEPA 2.
VT‑LVLM‑AR: A Video‑Temporal Large Vision‑Language Model Adapter for Fine‑Grained Action Recognition in Long‑Term Videos, Leveraging Vision‑Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization.
Timesformer
ViViT: A Video Vision Transformer.

SoA methods, such as the ones above, should be compared based on the quality of the embeddings. It will be up to you to define metrics for the quality.

References:

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, pages 513–528, 2018 (Diving48 dataset)
Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. FineDiving: A Fine-Grained Dataset for Procedure-Aware Action Quality Assessment. In CVPR, pages 2949–2958, 2022
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Yuyang Gao, Siyi Gu, Junji Jiang, Sungsoo Ray Hong, Dazhou Yu, and Liang Zhao. Going Beyond XAI: A Systematic Survey for Explanation-Guided Learning. ACM Computing Surveys, 56(7):1–39, 2024.
Expresso-AI: A framework for Explainable Video Based Deep Learning Models through gestures and expressions.
Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models.