Half a million photos every minute
The amount of data we churn out daily is growing rapidly. The Internet, social media, discussion forums and comment chains host unimaginable volumes of photos, text and (measurement) data. Google processes more than 40,000 searches per second – or about 3.5 billion per day. On SnapChat, more than half a million photos appear every minute. The number and range of devices that generate all these data are also increasing: computers, databases, cameras, mobile phones, smart watches, cars, sensors. One of the challenges for data scientists: figure out how all those data can be used in a reliable manner to create real value for society.
Early screening means better treatment
‘The research in which we have been able to use blog texts written by people with traumatic experiences to predict the likelihood of post-traumatic stress disorder (PTSD) later on in their lives is an example of a societally relevant big data application,’ says Bernard Veldkamp. ‘You can imagine that police trade unions, for example, would highly value any insight into the extent to which police officers may face PTSD. Accurate predictions mean you can take more effective, more targeted prevention measures. Military veterans could also benefit, or victim support organizations. Early screening means better treatment.”
Veldkamp's colleague Raymond Veldhuis notes that his research, in which automated photo analysis is used to determine whether or not the person in a photo is a CEO, also suggests many potential application directions. But it also raises questions, he adds. ‘Think, for example, of the ethical side. Assessing whether someone is a CEO or I harmless enough, but what if someone were to use photos to decide whether a person is or is not a criminal? Ethical responsibility, or accountability, plays an important role in this type of research.’
How it works
Veldkamp – whose specializations include text mining, or extracting information from large amounts of text – explains how the process of predicting the likelihood of PTSD on the basis of text analysis works. ‘We started with texts written by two groups of people, all of whom had suffered trauma in the past. One group had later faced post-traumatic stress disorder, the other group had not. We then used an algorithm to analyse per text which words occur at which frequency and trained the computer to compare the texts of the two groups. Based on the outcomes, we built a model that uses a set of 2,000 words per group to perform the same measurement and comparison with new texts. The result is a likelihood ratio; in this scenario we were able to predict on the basis of a single blog text written by the trauma victim whether or not the writer would develop PTSD – with an accuracy of up to 85%.’ In the course of the research, Veldkamp and his colleagues further refined the system, bringing back the key word set from 2,000 to a mere 200.
Beyond a simple yes or no
So what makes the work of Veldkamp, Veldhuis and their colleagues stand out from other big data achievements? ‘Machine learning – ore training systems to draw conclusions based on measurements – is everywhere nowadays. There’s nothing distinctive about that,’ says Veldkamp. ‘But most current systems can only give a yes or a no; we call it end-to-end machine learning. What takes place inside the machine between the data input and the conclusion is unclear in most systems. The conclusion is not explainable. That reduces the reliability, the generalizability and potentially also the honesty and ethical responsibility of such applications. Our research comes from our dissatisfaction with that.’
Veldhuis adds that there is a lot of buzz surrounding big and expectations are high. ‘With today’s huge computing power, we can indeed uncover hidden patterns in large data quantities, gaining genuinely valuable insights. But the big data phenomenon has not yet delivered on its real promise: in too many cases, the absence of explainability, combined with the use of insufficiently large data sets, the generalizability of conclusions cannot be proven: we can’t say whether system A will draw the same conclusions as system B. The emphasis on explainability in our research changes that. The question is no longer whether we can train a computer to find certain answers, but how – by which steps – the computer does it. The better we can explain that, the more reliable, responsible and valuable the outcomes will be.’
Teaming up with specialists is crucial
Veldkamp adds that collaboration with specialists makes a crucial difference. ‘In the test with the blog texts of trauma victims, we worked closely with psychiatrists. They can do what a data scientist cannot: interpret the possible implications of common words. This cross-disciplinary approach yields a system with high explainability. This is the future of big data.’