Assignments: DRONE-BASED OBJECT DETECTION AND EXPLANATION USING VISION-LANGUAGE MODELS | Pervasive Systems group

DRONE-BASED OBJECT DETECTION AND EXPLANATION USING VISION-LANGUAGE MODELS

PROBLEM STATEMENT: 

Autonomous drones are increasingly deployed for tasks such as surveillance, inspection, and search-and-rescue, where robust perception is critical. Traditional object detection models (e.g., YOLO-style CNNs) provide bounding boxes and class labels but offer limited insight into why a particular prediction was made. This lack of explainability can be problematic for safety-critical decisions, operator trust, and post-hoc analysis of failures. Recent advances in Vision-Language Models (VLMs) enable joint reasoning over images and natural language, allowing not only open-vocabulary object recognition but also textual explanations of model behaviour. However, it is unclear how well VLMs can (1) localize objects in drone imagery and (2) provide meaningful, operator-relevant explanations of their detections, especially under challenging aerial conditions (small objects, clutter, unusual viewpoints).

TASK: 

The goal of this project is to investigate the use of Vision-Language Models for drone-based object detection and explainability, without relying solely on traditional detectors. The student will evaluate how well a VLM can detect and describe objects in aerial imagery and to what extent its explanations are accurate and useful. The core research questions to be addressed are:

· How accurately can a VLM localize and identify objects in drone imagery compared to a conventional detector?

· Can the VLM generate meaningful, human-readable explanations for its detections (e.g., why a region is considered a “car” or a “person”)?

· In which scenarios (e.g., small objects, occlusions, unusual angles) do VLM-based explanations fail or become misleading, and how can these limitations be characterized?

WORK:

20% Theory: Study state-of-the-art VLMs for open-vocabulary object understanding and existing work on explainable object detection and visual explanations.

60% Implementation & Evaluation: Select and prepare an existing drone/aerial object detection dataset (e.g., VisDrone) with bounding box annotations. Implement a pipeline where a Vision-Language Model operates on the images to (a) localize objects (e.g., via prompted region proposals or open-vocabulary queries) and (b) generate natural-language explanations for each predicted object. Compare the VLM-based localization performance against a baseline detector (e.g., YOLO) and qualitatively/quantitatively assess the quality and correctness of the generated explanations using existing annotations and simple evaluation metrics

20% Writing: Document the methodology, experiments, and findings, and provide recommendations for when and how VLMs can be used for explainable drone-based object detection.

CONTACT: 

Adarsh Nanjaiya Latha (a.nanjaiyalatha@utwente.nl)