Half a million photos every minute

The amount of data we churn out daily is growing rapidly. The Internet, social media, discussion forums and comment chains host unimaginable volumes of photos, text and (measurement) data. Google processes more than 40,000 searches per second – or about 3.5 billion per day. On SnapChat, more than half a million photos appear every minute. The number and range of devices that generate all these data are also increasing: computers, databases, cameras, mobile phones, smart watches, cars, sensors. One of the challenges for data scientists: figure out how all those data can be used in a reliable manner to create real value for society.

Early screening means better treatment

‘The research in which we have been able to use blog texts written by people with traumatic experiences to predict the likelihood of post-traumatic stress disorder (PTSD) later on in their lives is an example of a societally relevant big data application,’ says Bernard Veldkamp. ‘You can imagine that police trade unions, for example, would highly value any insight into the extent to which police officers may face PTSD. Accurate predictions mean you can take more effective, more targeted prevention measures. Military veterans could also benefit, or victim support organizations. Early screening means better treatment.”

Veldkamp's colleague Raymond Veldhuis notes that his research, in which automated photo analysis is used to determine whether or not the person in a photo is a CEO, also suggests many potential application directions. But it also raises questions, he adds. ‘Think, for example, of the ethical side. Assessing whether someone is a CEO or I harmless enough, but what if someone were to use photos to decide whether a person is or is not a criminal? Ethical responsibility, or accountability, plays an important role in this type of research.’

How it works

Veldkamp – ​whose specializations include text mining, or extracting information from large amounts of text – explains how the process of predicting the likelihood of PTSD on the basis of text analysis works. ‘We started with texts written by two groups of people, all of whom had suffered trauma in the past. One group had later faced post-traumatic stress disorder, the other group had not. We then used an algorithm to analyse per text which words occur at which frequency and trained the computer to compare the texts of the two groups. Based on the outcomes, we built a model that uses a set of 2,000 words per group to perform the same measurement and comparison with new texts. The result is a likelihood ratio; in this scenario we were able to predict on the basis of a single blog text written by the trauma victim whether or not the writer would develop PTSD – with an accuracy of up to 85%.’ In the course of the research, Veldkamp and his colleagues further refined the system, bringing back the key word set from 2,000 to a mere 200.

Beyond a simple yes or no

So what makes the work of Veldkamp, ​Veldhuis and their colleagues stand out from other big data achievements? ‘Machine learning – ore training systems to draw conclusions based on measurements – is everywhere nowadays. There’s nothing distinctive about that,’ says Veldkamp. ‘But most current systems can only give a yes or a no; we call it end-to-end machine learning. What takes place inside the machine between the data input and the conclusion is unclear in most systems. The conclusion is not explainable. That reduces the reliability, the generalizability and potentially also the honesty and ethical responsibility of such applications. Our research comes from our dissatisfaction with that.’

Veldhuis adds that there is a lot of buzz surrounding big and expectations are high. ‘With today’s huge computing power, we can indeed uncover hidden patterns in large data quantities, gaining genuinely valuable insights. But the big data phenomenon has not yet delivered on its real promise: in too many cases, the absence of explainability, combined with the use of insufficiently large data sets, the generalizability of conclusions cannot be proven: we can’t say whether system A will draw the same conclusions as system B. The emphasis on explainability in our research changes that. The question is no longer whether we can train a computer to find certain answers, but how – by which steps – the computer does it. The better we can explain that, the more reliable, responsible and valuable the outcomes will be.’

Teaming up with specialists is crucial

Veldkamp adds that collaboration with specialists makes a crucial difference. ‘In the test with the blog texts of trauma victims, we worked closely with psychiatrists. They can do what a data scientist cannot: interpret the possible implications of common words. This cross-disciplinary approach yields a system with high explainability. This is the future of big data.’

prof.dr.ir. R.N.J. Veldhuis (Raymond)
Professor of Biometric Pattern Recognition, University of Twente. Studied Electrical Engineering at the University of Twente, Mathematics and Physics at Radboud University Nijmegen Focus areas: fundamental and applied research in the areas of face recognition (2D and 3D), fingerprint recognition, vascular pattern recognition, multibiometric fusion, and biometric template protection.

'So much is happening so fast in automation that we do not always understand how it all works. I always say: if it can be automated, it has to be explainable. That’s my drive. We are developing more and more systems that produce very interesting conclusions or decisions. The navigation systems in our cars are just one example. But I am not satisfied until we know exactly where those decisions come from. Only then will we have systems that add real value to society.’

prof.dr.ir. B.P. Veldkamp (Bernard)
Professor of Research Methodology, University of Twente Studied Applied Mathematics and Psychometrics at the University of Twente Focus areas: research methodology, data analysis, optimization, text mining, computer-based assessment.

‘In all kinds of sectors we’re seeing increasing amounts of data. Psychometry, for example, generates large amounts of highly controlled data. The ways in which we use those data to assess or comment on people must be both efficient and honest. I see contributing to that efficiency and honesty as an important task.’