A probabilistic approach towards handling data quality problems and imperfect data integration tasks

BACHELOR Assignment

Type: Bachelor CS

Period: TBD

Student: (Unassigned)

If you are interested please contact :

Description:

DuBio is an extension for PostgreSQL that we are currently developing for managing and manipulating uncertain data, or to use a more technical term, probabilistic data. Being able to manage data and the uncertainty about the data is an effective way to get a grip on and effectively dealing with data quality issues. One prominent purpose is data integration. Probabilistic data integration (PDI) is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. The PDI process contains two phases: (i) a quick partial integration where certain data quality problems are not solved immediately, but explicitly represented as uncertainty in the resulting integrated data stored in a probabilistic database such as DuBio; (ii) continuous improvement by using the data — a probabilistic database can be queried directly resulting in possible or approximate answers — and gathering evidence (e.g., user feedback) for improving the data quality.

We invite you to participate in this research and in making DuBio a success! You can do so by choosing one of the following subprojects

In-database indeterministic deduplication
An algorithm has been developed (in Python) for indeterministic deduplication, a form of deduplication where the uncertainty about where certain records are duplicates is explicitly managed. The work on indeterministic deduplication could have more impact if the process could execute fully inside the database, i.e., given a table of records and a table with similarity scores for these records, a PL/SQL function produces a probabilistic table with the indeterministic deduplication result. Execution performance should be evaluated under certain relevant conditions. This assignment may require some SQL trickery, so an aptitude for programming (in SQL) is beneficial.
A probabilistic version of the MICE data imputation approach
MICE, or Multiple Imputation by Chained Equations, is a powerful and flexible statistical technique for handling missing data. It's an iterative, predictive approach that imputes missing values one variable at a time using a series of regression models. It's particularly effective because it preserves the relationships between variables and accounts for the uncertainty that comes with imputation. MICE conducts an imputation process multiple times and then does an analysis and pooling to produce a single final dataset. This assignment is about replacing this final analysis and pooling step with the generation of a probabilistic table representing the final imputed dataset. Experiments should be done to evaluate the quality of the imputation compared to the original MICE.
Impact of caching probabilities in BDDs
The current implementation relies on (a) a dictionary as mentioned above, and (b) a binary decision diagram (BDD) for each record. The BDD is implemented as a user-defined type. It is unclear whether or not it would be beneficial to cache the probabilities in the BDD type. This subproject is about experimental investigation under which circumstances this is beneficial.
Evaluation of the conditioning task
One aspect of the probabilistic data integration approach is to improve data quality based on gathered evidence, which is called, conditioning. This subproject is about experimentally investigating the performance and scalability of conditioning.

A good place to start learning about what a probabilistic database is and what it can be useful for is the book chapter Probabilistic Data Integration published in the Springer "Encyclopedia of Big Data Technologies". Besides being an introduction, it also contains references to scientific publications on DuBio and probabilistic databases, in general. Secondly, the documentation of DuBio is also a good place to learn about the system: https://github.com/utwente-db/DuBio/wiki.