Evaluating geospatial machine learning predictions: a data-driven perspective on cross-validation
Yanwen Wang is a PhD student in the Department of Geo-information Processing. (Co)Promotors are prof.dr. R. Zurita Milla and dr. M. Khodadadzadeh from the Faculty ITC.
 Machine learning (ML) has become a mainstream approach for geospatial predictions because it efficiently supports the creation of a wide array of spatially continuous products, like soil and ecological maps. Random cross-validation (CV) is widely used to evaluate geospatial ML predictions. For random CV to provide realistic and accurate evaluations, the training samples and the prediction locations should have the same distribution. However, in most practical situations, the samples do not uniformly cover the entire prediction area. This often means that samples and prediction locations have varying degrees of dissimilarity, and this strongly influences the quality of evaluations. Since random CV cannot naturally cope with this dissimilarity, it tends to provide over-optimistic evaluations. In this context, geoscientists have proposed a series of spatial CV methods that consider spatial autocorrelation. Although spatial CV methods help to do avoid the risk of getting over-optimistic evaluations, they still fail to address the core of the dissimilarity between samples and prediction locations and, as a result, they can still provide nonrepresentative evaluations.
Machine learning (ML) has become a mainstream approach for geospatial predictions because it efficiently supports the creation of a wide array of spatially continuous products, like soil and ecological maps. Random cross-validation (CV) is widely used to evaluate geospatial ML predictions. For random CV to provide realistic and accurate evaluations, the training samples and the prediction locations should have the same distribution. However, in most practical situations, the samples do not uniformly cover the entire prediction area. This often means that samples and prediction locations have varying degrees of dissimilarity, and this strongly influences the quality of evaluations. Since random CV cannot naturally cope with this dissimilarity, it tends to provide over-optimistic evaluations. In this context, geoscientists have proposed a series of spatial CV methods that consider spatial autocorrelation. Although spatial CV methods help to do avoid the risk of getting over-optimistic evaluations, they still fail to address the core of the dissimilarity between samples and prediction locations and, as a result, they can still provide nonrepresentative evaluations.
The difficulties of random and spatial CV methods to handle these dissimilarities stem from the data-driven nature of geospatial ML predictions. We hypothesize that adding a data-driven perspective to CV methods will lead to more representative evaluations. In this PhD thesis, this hypothesis is addressed by means of three studies.
In the first study, we proposed spatial+ CV (SP-CV) to evaluate predictions with substantial dissimilarities. SP-CV employs cluster ensemble to incorporate data features information into the folds split. When compared with random CV and a representative spatial CV – block CV, our experimental results show that the evaluations of SP-CV are the closest to the actual prediction errors. This suggests that SP-CV is more accurate than other CV methods when predictions have substantial dissimilarities.
In the second study, we proposed dissimilarity quantification by adversarial validation (DAV). DAV was tested on a series of experiments with gradually increasing dissimilarities, and our results show that it can effectively quantify dissimilarity across the entire range of dissimilarity values. In addition, by numerous experiments, we also investigated the relationship between dissimilarity and random CV and spatial CV methods evaluations in detail. The experimental results are also important supplements to previous studies with only a few dissimilarity scenarios.
In the third study, we proposed dissimilarity-adaptive CV (DA-CV) to evaluate predictions with diverse dissimilarities. DA-CV combines the random and SP CV methods according to the specific dissimilarity of the case study. We set up a large number of experiments with changing dissimilarities on both synthetic and real datasets and compared DA-CV with random and spatial CV methods. Our experimental results show that, for most predictions, DA-CV provides the most accurate evaluations.
The outstanding performance of SP-CV proves that the incorporation of information on the data feature space improves the evaluation of geospatial ML predictions. The first study is a successful beginning for bringing a data-driven perspective to CV methods. The second study of DAV extends our work from a particular case (substantial dissimilarity) to more general scenarios (diverse dissimilarities). DAV provides a clear data-driven perspective to understand different predictions and their evaluations. The third study is the culmination of the first and second studies. The ability of DA-CV to consistently provide accurate evaluations for diverse dissimilarities is attributed to the comprehensive consideration of both samples and prediction locations through the feature space. It demonstrates that reforming CV from a data-driven perspective does uniformly improve the evaluations of geospatial ML predictions.




