PhD Defence Alby Duarte Rocha

tuning a statistical trade-off between spectral and spatial domains to predict plant traits with hyperspectral remote sensing

Alby Duarte Rocha is a PhD student in the department of Natural Resources. His supervisor is prof.dr. A.K. Skidmore from the faculty of Geo-Information Science and Earth Observation.

By monitoring biochemical and biophysical cycles in different ecosystems and relevant geolocations, vegetation dynamics can be better understood. An adequate selection of plant traits can perform a role as surrogate indicators for monitoring ecosystem dynamics. Less utopian yet important, the assessment of plant trait can guide to managing crops more efficiently and sustainable, improving food security by precision agriculture. However, it is a hard task to sample and measure plant trait, which reduces enormously the availability in the scale and time that is needed.

As direct plant trait measurements are quite a time-consuming, expensive and usually sample destructive, a need exists for other forms of indirect estimations to make an efficient monitoring system feasible. Up to now, the more acceptable alternative is measuring the vegetation surface by optical instruments. Using an optical sensor, it is possible to capture the reflection of the plant in the electromagnetic spectrum, and use it to estimate a biochemical and biophysical characterisation of the vegetation. Remote sensing has the potential to speed up the measurement (or estimation) of  plant traits, allowing monitoring of vegetation dynamics over a wider range of spatial and temporal scales.

An optical instrument usually measures the reflectance of a target surface in a specific range of the spectrum, and divide it in different wavelengths. The number of wavelengths can vary from a couple of bands to thousands of them. Variations of chlorophyll leaf concentration will affect the reflectance in different regions of the spectra than water leaf content. Therefore, narrow wavebands, in general, quantify these plant traits better than wide bands. For this reason, hyperspectral remote sensing often presents more accurate estimation of plant traits than multi spectral measurements. Besides that such spectral specificity allows for higher accuracy, it also leads to problems as the data is highly dimensional with many redundant wavelengths.

Modelling a large number of serially correlated wavelengths with relatively few observations of ground references to support it, can bring serious issues. Problems such as multicollinearity and model overfitting are the most common. Overfitting leads to very specific models which lack in generalisation and are only accurate for the same dataset used to fit it. The risk of overfitting is magnified by noise from atmospheric effects and variations in certain regions of the spectra from sunlight illumination. The risk also increases using machine learning algorithms or supervised methods (with the support of the response) for model selection or tuning parameters. A new method to tuning complexity called Naïve Overfitting Index Selection (NOIS) was developed to reduce the risk of overfitting while modelling with machine learnings. The NOIS method uses simulated data based on the covariance matrix of the original set to determine the maximum model complexity supported by the number of observations.

Remote sensing data captured from landscapes of vegetation is likely to present also spatial and temporal variations apart from the ones in the spectral domain. Often neglected while modelling because the dimensionality of the spectral domain, the spatiotemporal autocorrelation on the observations can cause serious inferential problems. Machine learning algorithms present unstable and unreliable predictions under significant autocorrelation, as was shown here by the model assessment using simulated landscapes with increscent level of spatial dependency. A spatial model is indicated in this case, but for modelling space explicit using spectral information as covariates, the number of wavelengths has to be reduced drastically. It can be achieved by spectra-space tuning process that balances the trade-off between accuracy and overfitting in both domains.

Temporal variations in sunlight illumination and atmospheric conditions have stochastic nature, and consequently, carry some degree of unpredictability. Patterns in soil fertility, slope or moisture drive the spatial dependency of plant traits, and can also be inherently stochastic. Whether modelling with hyperspectral data using the three domains explicitly or not, a certain level of uncertainties will always be present. Therefore, appropriate sampling designs and regression models are crucial but worthless without an in-depth assessment of the uncertainties coming from the three domains.