Handling Missing Data with Meta-Learning and Large Language Models
Işıl Baysal Erez is a PhD student in the department Datamanagement & Biometrics. (Co)Promotors are prof.dr.ir. M. van Keulen and dr. M. Poel from the faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente.
Missing data is a common data quality issue that can make data analysis challenging. Tabular datasets, widely used by companies and organizations in a variety of sectors, often suffer from missing data problems. Furthermore, it is crucial for data analysis to address missing data problems by means of, for example, data imputation, because most predictive models are not designed to handle missing data natively.
For several decades, new imputation methods have been developed continuously; however, no single approach has been proven to be universally effective across all missing data scenarios. Choosing the appropriate imputation method for a given dataset may depend on several factors, such as the missingness level and missing data mechanisms. A commonly used approach is trial-and-error, typically carried out via grid search, in which data analysts experiment with different imputation methods. However, this method is time-consuming and computationally expensive. As an alternative to traditional trial-and-error methods, Automated Machine Learning (AutoML) tools are end-to-end machine learning pipeline systems that automate key steps such as method selection, hyperparameter tuning, and evaluation. The selection of data imputation algorithms also falls within the scope of AutoML. These tools use optimized search-based approaches for method selection that are still quite computational expensive but less than trial-and-error methods. In this thesis, a meta-learning-based recommendation system has been developed that uses data characteristics derived from datasets with missing values. Using a recommendation system, data analysts can obtain a specific imputation method recommendation in a computationally inexpensive manner. The recommendation system is actually obtained through our recommender system development framework with which more advanced recommenders can be developed in the future.
The purpose of imputing a dataset is for the subsequent development of a predictive model. We have shown that well or even perfectly reconstructed data does not guarantee the best predictive performance. Therefore, it is important to consider imputation method selection and machine learning method selection as a joint selection problem to account for the relationship between them. Unfortunately, joint method selection is more complex and costly than selecting only one method, since it inherently requires analysis of all combinations. Our meta-learning approach can provide joint imputer and regressor recommendations in an equally inexpensive manner.
Besides the challenges of finding appropriate imputers and regressors for the datasets with missing values, explaining the recommendations provided by these models is also crucial for data analysts. We propose several recommendation models for imputer+regressor selection using the meta-learning approach. To ensure transparency and trust in these automated recommendations, we incorporate global and local explanation techniques that provide insight into both general model behaviour and reasoning behind individual recommendations.
In recent years, large language models (LLMs) have penetrated domains far beyond natural language processing, including tabular data analysis, code generation, and recommendation systems. As the use of LLMs for handling missing data remains understudied, we explore their potential to support algorithm selection and inspire novel imputation strategies based on the type of missing data mechanisms. We find that LLMs offer data analysts a promising alternative approach for addressing missing data issues, both as a new imputation method as well as for imputation method selection.
In conclusion, this thesis provides insight and practical tools for addressing missing data issues: an explainable, computationally inexpensive meta-learning-based recommender system, several frameworks for further development, and promising ways in which LLMs can be utilized for this purpose.
More events
Thu 19 Feb 2026 14:30 - 15:30PhD Defence Cíntia Meireles Urbina | Essays on Environmental, Social, and Governance (ESG): Impacts on the Cost of Debt
Thu 19 Feb 2026 16:30 - 17:30PhD Defence Jonathan Vaneyck | Transmission of the amyloid fold between alpha-Synuclein and other protein species
Fri 20 Feb 2026 12:30 - 13:30PhD Defence Aniruddha Paul | Engineering Microphysiological Ecosystems
Fri 20 Feb 2026 14:30 - 15:30PhD Defence Rogier Harmelink | Data sharing in inter-organizational systems
Mon 23 Feb 2026 10:30 - 11:30PhD Defence Yuanting Wen | Beyond Leadership Behaviors- A Multi-Context Study of How Perceptual Alignment and Psychological Empowerment Drive Employee Entrepreneurial Action