Towards smart speech recognition
Automatic speech recognition is ready for the next level, thinks computer scientist Khiet Truong. She is developing technology that gets more out of a conversation than only the spoken words.
According to Truong, we still know too little about the nonverbal aspects of communication, that is to say: how a person says something. For a long time, technology has mainly focused on the basic form of speech recognition: on whát a person says. You can see this with virtual assistants such as Siri and Alexa.
It’s all about nuances
"The interesting part for me is the layer beyond the words, the emotions or mental state of a speaker. Can we extract that layer from speech? Are both speakers on the same page in a dialogue? What is the meaning of the laugh in this conversation?" says Truong. "That's important information because in communication it's all about nuance. About silences, a sigh, or a laugh of a person. It makes my research relevant."
The bottom line of Truong’s research is that the machine (the healthcare robot, for example) should better sense what the human is saying. And vice versa: the machine should communicate naturally with the user. In the past few years of research, she has found out that developing improved speech recognition is not that easy. "It's super complex. Even psychologists don't quite know how it works yet. So, it's extra difficult for computer scientists like me to convert human behavior into ones and zeros."
Interaction depends on context and on the person speaking, many studies have shown. "For a long time, I thought that it would be possible to make a general model for automatic speech recognition. Now I know the individual should be given much more attention and the context should also be modeled. This concerns nonverbal communication such as giving a nod."
Psychology has for a long time assumed six basic emotions: anger, joy, disgust, fear, surprise, and sadness. In addition, there are speech characteristics such as pitch, speaking rate, disfluencies, and sighs. Databases with speech recordings of actors could be the input for an intelligent form of speech recognition. But spontaneous emotions are not considered. Moreover, the question is how often an emotion like anger can be heard in a conversation.
Khiet Truong wants to focus more on mental well-being. She is doing research on psychiatric disorders such as bipolar disorder, depression, and dementia. "Hopefully in the future, we will be able to predict a depression by examining the way a person is talking."
A research project is underway in collaboration with colleague Vanessa Evers, a professor in the Human Media Interaction group in Twente, and the Institute for Sound and Vision in Hilversum (Nederlands Instituut voor Beeld en Geluid). The research focuses on the interaction between robot and child using media archives. It investigates what factors could potentially influence the trust that the child has in the robot when it acts as a mediator interfacing with a media archive and whether we can hear this trust reflected in the interaction between child and robot.
Research and education
Khiet Truong teaches several courses within the Interaction Technology master's program. "In the classroom, I like to use current news topics from the media to show that our work is relevant. The link to reality is important for me. It shows that we are doing this for a reason and that it ends up somewhere. What I like the most about teaching is the one-on-one coaching of students, because I can discuss my research more in detail and can be closer to my students. It's great to see how young people gain scientific insights and develop further."
About Khiet Truong
Dr. Khiet Truong (1980) studied linguistics at Utrecht University and was a researcher at TNO on the automatic recognition of emotion in speech and laughter. In 2009 she obtained her PhD in computer science from the University of Twente, where she is now an assistant professor in the Human Media Interaction group. Truong is also a board member of stichting Open Spraaktechnologie (Open Speech Technology Foundation), which collects speech recognition software and aims to make it available via open source.
Click on the image to get to the photo folder. These press photos may be used free of copyright limitations.