Multi-Modal Earth Observation and Deep Learning for Urban Scene Understanding
Abhisek Maiti is a PhD student in the department Department of Earth Observation Science. (Co)Promotors are prof.dr.ir. M.G. Vosselman and dr.ir. S.J. Oude Elberink from the faculty of Geo-Information Science and Earth Observation.
Semantic segmentation in Remote Sensing data abstractly categorizes each pixel into a predefined set of classes. Leveraging deep learning in the segmentation process enables us to deal with intricate visual patterns and more complex landscapes. Deep learning strategically utilizes sophisticated neural networks to process and interpret vast amounts of data efficiently and accurately. This research collectively addresses various aspects of deep learning based semantic segmentation in Earth Observation (EO), focusing on the integration of multi-modal data, the impact of label noise, and the need for diverse datasets.
Firstly, we explored the synergistic potential of combining 2D color images and 3D point clouds to enhance semantic segmentation, an area where traditionally only uni-modal data has been used. We recognized that fusing these two modalities is inherently complex due to their distinct characteristics, differences in dimensions, the challenge of aligning them in the same spatial frame, and the biases unique to each modality. To address these complexities, we developed a novel model named TransFusion. Our approach is innovative in its ability to directly fuse 2D images with 3D point clouds, eliminating the need for any potentially lossy preprocessing of the point clouds. This direct fusion method sets our model apart from existing techniques. We compared TransFusion's performance against the baseline Fully Convolutional Network (FCN) model, which typically uses images with depth maps for segmentation. Our model demonstrated superior performance, particularly regarding mean Intersection over Union (mIoU), a key metric for assessing segmentation quality. Specifically, we observed an improvement of 4% and 2% in mIoU for the Vaihingen and Potsdam datasets, respectively. These results are significant as they showcase TransFusion's ability to effectively learn and interpret the spatial and structural information from the multi-modal data. This leads to more accurate semantic segmentation, underscoring the value of multi-modal data fusion in enhancing the quality and reliability of segmentation tasks in our field.
Secondly, we delved into the significant yet often overlooked impact of label noise on the performance of deep learning models in semantic segmentation. Label noise refers to incorrect annotations in the training data, a common issue, especially in remote sensing applications. These applications typically involve limited spatial datasets labeled by domain experts, where high variability between and within observers can lead to inaccurate predictions. To investigate this issue, we first simulated label noise and conducted experiments using two datasets comprising very high-resolution aerial images, height data, and deliberately inaccurate labels. These datasets were instrumental in training our deep learning models. Our focus was on understanding how different types of label noise affect the performance of these models. We discovered that the response to label noise varies across different classes. A crucial factor we identified was the typical size of an object in a given class, which significantly influences how well the model performs when trained with erroneous labels. One of the key findings was that errors caused by relative shifts of labels have the most substantial impact on the model. Interestingly, the model exhibited more tolerance towards random label noise compared to other types of label errors. Our observations showed a noticeable decrease in accuracy by at least 3\%, even when only 5\% of the label pixels were erroneous. Through this study, we offer a new perspective on evaluating and quantifying the effect of label noise on model performance. This understanding is crucial for developing more reliable semantic segmentation practices, particularly in fields where precise labeling is challenging and prone to variability.
Finally, we addressed a growing interest in semantic segmentation within Earth Observation, particularly in complex urban scenes. While there are several datasets available for semantic segmentation, we noticed a geographical imbalance. Most high-resolution urban datasets are predominantly from Europe and North America, leaving Southeast Asia underrepresented. This lack of diversity in urban design worldwide poses a challenge to the applicability of computer vision models, especially when training datasets do not adequately represent global diversity. To address this issue of data diversity, we introduced the UAVPal dataset featuring complex urban scenes from Bhopal, India. Alongside this dataset, we developed a novel dense predictor head for semantic segmentation. This predictor head is designed to effectively utilize multi-scale features, enhancing the capabilities of a strong feature extractor backbone. Our segmentation head uniquely learns the importance of features at various scales for each class and refines the dense predictions based on this understanding. We tested our newly designed head with a state-of-the-art backbone across multiple UAV datasets and a high-resolution satellite image dataset for land use and land cover classification. The results were promising, showing up to 2% better mIoU improvements across various classes. Notably, our approach also led to a significant reduction in computational requirements. We observed nearly a 50% decrease in computing operations needed when using our proposed head compared to traditional segmentation heads. This not only improved performance but also enhanced efficiency, marking a significant advancement in the field of semantic segmentation for Earth Observation.