HomeNewsTackling quality concerns around big data

Tackling quality concerns around big data ITC researchers develop method on evaluating volunteered data

As the adoption of online information and location-aware technologies grows, more and more data will become available for research purposes. Data collected by volunteers or citizen scientists is a valuable source to analyse geographic phenomena such as the impact of climate change on plant and animal biological cycles. But only if the data can be evaluated on its quality. Researchers from the Faculty ITC of the University of Twente have found a way to study the quality of volunteered data efficiently so as to make it useful for scientists.  

The researchers published their findings recently in the multidisciplinary scientific journal PLoS ONE and applied it to long-term records of volunteered observations on plant flowering and leafing [1]. For this work, the ITC researchers worked with the USA National Phenology Network (USANPN) and earlier this year they published a quality-checked phenological dataset in Nature Scientific Data [2].

Big data: growing data collections

Improvements in online information communication and mobile location-aware technologies have led to a dramatic increase in the amount of volunteered geographic information (VGI) in recent years. The collection of volunteered data on geographic phenomena has a rich history worldwide. For example, the Christmas Bird Count has studied the impacts of climate change on spatial distribution and population trends of selected bird species in North America since 1900. Nowadays, several citizen observatories collect information about our environment. This information is complementary or, in some cases, essential to tackle a wide range of geographic problems.

Quality concerns

Despite the wide applicability and acceptability of VGI in science, many studies argue that the quality of the observations remains a concern. Data collected by volunteers does not often follow scientific principles of sampling design, and levels of expertise vary among volunteers. This makes it hard for scientists to integrate VGI in their research.

Low quality, inconsistent, observations can bias analysis and modelling results because they are not representative for the variable studied, or because they decrease the ratio of signal to noise. Hence, the identification of inconsistent observations clearly benefits VGI-based applications and provide more robust datasets to the scientific community.

Identify inconsistencies in volunteered data

In their paper the researchers describe a novel automated workflow to identify inconsistencies in VGI. “Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” say Hamed Mehdipoor and Dr. Raul Zurita-Milla, who work at the Geo-Information Processing department of ITC.

The workflow relies on the availability of contextual information and is built using a combination of dimensionality reduction, clustering and outlier detection techniques. In their PLoS ONE publication, the workflow was illustrated using observations on the timing of the first flower of lilac plants collected by volunteers in North America. The history of these observations is quite long (volunteers started to collect this data in the 1950s as indicated in their Nature Scientific Data publication) and their uses are various: from supporting the planning and execution of various agronomical practices, to studying the magnitude and direction of climate change at continental scales.

While some inconsistent observations may reflect real, unusual events, the researchers demonstrated that these observations also bias the trends (advancement rates), in this case of the date of lilac flowering onset. This shows that identifying inconsistent observations is a pre-requisite for studying and interpreting the impact of climate change on the timing of life cycle events.

More information on the related publications:

1. Mehdipoor H, Zurita-Milla R, Rosemartin A, Gerst KL, Weltzin JF. Developing a Workflow to Identify Inconsistencies in Volunteered Geographic Information: A Phenological Case Study. PLoS ONE. 2015;10(10):e0140811. doi: 10.1371/journal.pone.0140811

2. Rosemartin AH, Denny EG, Weltzin JF, Lee Marsh R, Wilson BE, Mehdipoor H, Zurita-Milla R and Schwartz, MD. Lilac and honeysuckle phenology data 1956–2014. Sci Data. 2015;2:150038. doi: 10.1038/sdata.2015.38

L.P.W. van der Velde MSc (Laurens)
Spokesperson Executive Board (EB)