HomeEventsPhD Defense Jaebok Kim

PhD Defense Jaebok Kim

automatic recognition of engagement and emotion in a group of children 

Jaebok Kim is a PhD student in the research group Human Media Interaction(HMI). His supervisor is prof.dr. V. Evers of the faculty of Electrical Engineering, Mathematics and Computer Science  (EEMCS). 

Young children are developing social skills at their own pace, large variances in development occasionally occur and lead to imbalanced engagement in small groups. In that situation, an agent such as a social robot can recognise individual levels of engagement and support a child who is less engaged than others. This dissertation aimed to develop automatic methods for the recognition of engagement levels and emotion of children in a small group play setting. This posed great challenges such as modeling temporal and group dynamics in a noisy environment. My thesis consists of three parts: (1) automatic ranking of engagement in a group of children; (2) automatic recognition of emotion in speech; and (3) integration of emotional states into an engagement-ranking model.

Firstly, I created a corpus of spontaneous interactions of children in group play and defined non-verbal features: vocal turn-taking and body movement. To deal with group and temporal dynamics of engagement, I developed an ordinal coding scheme that compares children’s engagement levels. Secondly, several automatic recognition methods were examined, and a SVM-based ranking approach worked best for our task. Finally, I proposed a temporal-ranking algorithm that utilizes temporal transitions between engagement levels. Applying this algorithm, I found significant gains over state-of-the-art methods.

I aimed to further enhance my engagement-ranking method by utilizing emotional states of children. To this end, I developed methods that generalize speech emotion recognition models over diverse contextual factors, such as speakers, expression styles, and languages. I aggregated many corpora that cover diverse contextual factors. The first study proposed to automatically learn discriminative representations of emotional utterances by utilizing information of gender and naturalness as subtasks. My results indicated that multi-task learning of gender and naturalness improved the generalization of models over the diversity of gender and naturalness. The next two studies focused on the optimization of deep temporal architectures. I found that skip connections between layers helped the optimization of such deep temporal architectures. Also, three-dimensional convolutional networks were found to be capable of automatically learning discriminative representations of emotional utterances without Long-Short-Term-Memory units.

Finally, I integrated speech emotion recognition methods into the engagement-ranking model. Emotional-state-based features were as informative as turn-taking-based features. I proposed a deep convolutional ranking network (DCRN) that learns the compact representation of high-dimensional features in a more efficient way and examined various fusions methods such as early, intermediate, and late fusion. I found that late fusion — in which multiple and independent DCRNs learn an abstract representation of each feature set and a simple classifier uses the compact representations — delivered the best performance.

Throughout the chapters of this dissertation, I developed automatic methods that recognize individual engagement levels by considering the temporal and group dynamics of engagement. My results indicated that the ordinal coding and learning was the most suitable for the task. This dissertation also provides practical and robust guidelines of evaluating deep architectures for speech emotion recognition tasks. In particular, large-scale cross-validations using aggregated corpora are necessary to see real performance in the wild.