UTDSIDSIEventsDSI Meet Up: Seminar Series on Computer Vision

DSI Meet Up: Seminar Series on Computer Vision Improving the Generalization of ViTs for Action Understanding with VLM Pre-Training

This DSI Meet Up focuses on Pure Motion for Video Analysis, exploring how motion based representations can improve the way computer vision models analyse and understand video. The series is co-organised by members of the DMB group and collaborating colleagues across the University of Twente.

The “Computer Vision” seminar series aims to bring together students and researchers passionate about enabling machines to interpret and understand visual information. Our goal is to highlight cutting-edge research, showcase innovative applications, and foster collaborations among experts from nearby institutions. Each session will feature an invited speaker presenting their work on a specific topic within the broad and fast-evolving field of computer vision.

By joining these seminars, you’ll gain insights into the latest academic and industrial developments, exchange ideas with peers, and explore opportunities for joint research and innovation. Whether you are a PhD candidate, an experienced researcher, or simply curious about computer vision, feel welcome to join these seminars.

The seminars are open to everyone at UT interested in the topic. 

Abstract

Generalization of ViT's

Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, when these models are frozen and applied to downstream tasks, their performance drops significantly, revealing poor generalization. In this presentation, we will introduce the Four-Tiered Prompts (FTP) framework that employs feature processors to transform the ViT's visual embeddings. With FTP, we increase the ViT's generalization ability by forcing the visual encoder to incorporate relevant, semantic information. Importantly, we only employ the VLM during training, inference incurs a minimal computation cost. We discuss the application of the framework on video action understanding tasks.

Speaker

Ronald Poppe

Ronald Poppe is an associate professor in the Information and Computing Sciences Department of Utrecht University. His research interests center around the analysis of human (interactive) behavior from videos and other sensors, with applications in media analysis and generation, and in the clinical domain. He received a Ph.D. from the University of Twente, The Netherlands (2009) and was a visiting researcher at the Delft University of Technology, Stanford University, and University of Lancaster.

Registration & contact 

Please register via the registration button below. For more information and questions regarding the seminar series on computer vision, please contact Estefanía Talavera Martínez.

DSI Meet Up: Seminar Series on Computer Vision Improving the Generalization of ViTs for Action Understanding with VLM Pre-Training
Register