[D] Probabilistic data benchmark

BACHELOR Assignment

Probabilistic data benchmark 

Type: Bachelor CS 

If you are interested please contact :

Description:

Advances in probabilistic databases have the potential of providing for more automatic and robust data integration [1]. Probabilistic data is data that is uncertain, for example, a value could be "25 with a probability of 70% or 26 with a probability of 30%". The idea is that if two sources disagree about a value, one can integrate the sources anyway, but store the disputed value as an uncertain value. Many more data integration problems can be  handled in an analogous manner [1].

Assignment:

The probabilistic data integration approach requires scalable probabilistic database technology. There are research prototypes around such as MayBMS and Trio as well as probabilistic logics such as ProbLog and JudgeD that may also store and query probabilistic data. The objective of this research project is to develop a benchmark for comparing probabilistic databases for their scalability on data integration tasks. The benchmark should include a data generator capable of generating data integration results of varying sizes. The idea is to generate data modelled after the real-world case of integrating biologic databases as in [2]. The benchmark should also include several relevant queries for which we can measure their execution time.
To validate the benchmark, the scalability of the mentioned systems can be compared using the benchmark.

References:

[1] van Keulen, M. (2018). Probabilistic Data Integration. In S. Sakr, & A. Zomaya (Eds.), Encyclopedia of Big Data Technologies Cham: Springer. https://doi.org/10.1007/978-3-319-63962-8_18-1
https://research.utwente.nl/en/publications/probabilistic-data-integration-3

[2] Wanders, B., van Keulen, M., & van der Vet, P. E. (2015). Uncertain Groupings: Probabilistic combination of grouping data. In Proceedings of the 26th International Conference on Database and Expert Systems Applications, DEXA 2015 (pp. 236-250). (Lecture Notes in Computer Science; Vol. 9261). Switzerland: Springer. https://doi.org/10.1007/978-3-319-22849-5_17
https://research.utwente.nl/en/publications/uncertain-groupings-probabilistic-combination-of-grouping-data-2