Unsupervised learning approaches for non-stationary data streams
Due to the COVID-19 crisis the PhD defence Kemilly Dearo Garcia will take place online (until further notice).
The PhD defence can be followed by a live stream.
Kemilly Dearo Garcia is a PhD student in the research group Datamanagement & Biometrics (DMB). Her supervisor is prof.dr. J.N. Kok from the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS).
Modern society is surrounded by several applications which are daily generating large volumes of data. Nowadays, anyone can monitor their physical activities in real-time by using smartphones or wearable devices. Also, business and governments can learn more about their clients and citizens by analysing information from social media, for example. This data is called data streams when it is a sequence of data generated continuously, usually at high speed. A data stream is also potentially unbounded in size and may not be strictly stationary.
Extracting useful knowledge from data streams is challenged due to several constraints. A data stream requires that a learning algorithm acts in dynamic environments. Meaning that the learning algorithm should allow for real-time processing. Moreover, it should be able to adapt to changes over time, considering the non-stationary nature of the data stream.
In the last few decades, many machine learning approaches have been proposed for data streams. Most of them are based on supervised learning. These approaches rely on labeled data to adapt their models to the changes in data streams. However, the process of labeling data is usually costly and can require domain expertise. Besides, if the data is collected at high speed, it may be the case that there will not be enough time to label it.
In this thesis, we aim to propose unsupervised and incremental machine learning algorithms for data streams. We focus on algorithms able to update their classification model with few or without external feedback. We start by addressing the problem of concept drift in data streams with few labeled data. For that problem, we propose a semi-supervised approach called Sliding Window Clusters. This method learns the current patterns from the data stream by selecting and summarising the most relevant data. We also study how to learn from data streams when novelties appear over time. So, we proposed an unsupervised learning method called Higia which can classify data as normal, novelty or concept drift. In this thesis, we propose an approach to combine different unsupervised approaches into a classification model. We test this approach considering two scenarios. The first is called Homogeneous Ensemble Clustering for Data Streams and it is based on the combination of different runs from the same clustering algorithm. We also consider the scenario called Heterogeneous Ensemble Clustering for Data Streams, which is based on the combination of different clustering algorithms. These methods allow for the use of clustering approaches with a different bias to obtain a more robust classification model.
Furthermore, we evaluate the state-of-art approaches, commonly referred to in the literature of novelty detection in data streams.
Most of this thesis focus on clustering approaches. However, given the popularity of neural networks, we also propose Ensemble of Auto-Encoders. This approach is based on the combination of auto-encoders into an ensemble model. Each auto-encoder is specialised in recognising one particular class. The Ensemble of Auto-Encoders has a modular structure that has the advantage of making the model easily adapted to the changes from the data. Besides, it allows for personalised models because the model can adapt to the most request classes. This contribution is applied to the problem of Human Activity Recognition. Experimental results show the potential of the approaches mentioned.
Keywords: data streams, unsupervised learning, incremental learning