Seeing with the sound - Sound-based human-context recognition using machine learning
Due to the COVID-19 crisis the PhD defence of Wei Wang will take place (partly) online.
The PhD defence can be followed by a live stream.
Wei Wang is a PhD student in the research group Pervasive Systems (PS). His supervisor is prof.dr. P.J.M. Havinga from the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS).
This research is about using sound to detect the contexts of human, i.e. what are people doing and how many people are around. Many modern applications such as smart buildings, healthcare or retails heavily depend on these human-context information to make our lives better. Take smart buildings as an example, energy can be saved by turning lights off when no one is around, comfort can be improved when room temperature is automatically adjusted according to the activities of people.
Numerous sensors are substantially used in this domain, such as simple PIR sensors that detect human presence, camera systems that monitor and recognize very detailed human actions, wearable devices that track user movement and identity, etc. None of these sensors and technologies is perfect, and each has pros and cons in different aspects. For instance, PIR sensors are cheap and non-intrusive but only give binary presence information. Cameras can identify human activities info at a fine-grained level, but are more expensive, privacy-risky and require line-of-sight. Wearable devices are diverse, but all of which need careful maintenance and often let users feel intrusive and troublesome.
Our research addresses the above-mentioned challenges by using sound sensors to detect human-context information indoor. Sound is everywhere and has many advantages such as rich in information, no line-of-sight problem, etc. Audio sensors or microphones are also very suitable for indoor applications as they are cheap, small and easy to install. On the other hand, sound also has obvious challenges such as noise interference and the overlap of multiple sounds. In addition, sound-based applications in buildings may need some more considerations, such as privacy concerns and resource constraints of devices.
To study and address the impact of noises and overlapping sounds, our research is conducted on different scenarios and datasets, i.e. from clean sounds in quiet environments to overlapping sounds in crowded environments. In order to tackle the challenges in different scenarios, several methodologies are carefully designed and compared. Together with the performance or accuracy, we also compare the memory and time cost to show how they fit resource-constraint devices. In our research, an innovative lightweight CNN-based model is also proposed to balance performance and complexity. In some experiments, we strip the voice bands from audio inputs to explore the possibilities of using low privacy-risk data while still maintaining high accuracy.
In this thesis, we use sound to detect human activities and the number of people, which are the two very basic elements of human-context. We start with classifying single-source sounds that are related to human activities. Next we propose a multi-microphone model for noisy environments. This model first decomposes overlapping sounds and then recognize each of the decomposed signals. Thirdly, we provide a crowd-activity based solution for crowded environments. Instead of recognizing individual sound events, we think it better to estimate the ratio of different sounds in order to sense the human-context in the crowd. Finally, we proposed a voice-based people counting model. We count people from both estimating speakers in multi-speaker sounds and clustering single-speaker sounds. In this task, the key voice-feature is not from engineered features but is retrieved from transfer-learning.
In a brief conclusion, our study has shown that sound is suitable for both human activities recognition and people counting applications in the context of quiet environments. When environments become nosier, sound based models become less reliable and also require much more resources.