DuBio

Project duration:

July 2021 - June 2025

probabilistic database engine

Project summary

DuBio is a research prototype developed at the University of Twente that extends PostgreSQL with native support for probabilistic, or uncertain, data. The system is motivated by the challenge of data quality in data integration scenarios: rather than resolving inconsistencies and ambiguities immediately, DuBio explicitly represents them as probabilistic uncertainty in the database. This approach — known as Probabilistic Data Integration (PDI) — allows data quality problems encountered during integration to be captured as a first-class result of the process, enabling continuous improvement over time as additional evidence, such as user feedback, becomes available. Uncertainty is modeled through random variables with mutually exclusive alternatives and associated probabilities, and records are annotated with logical sentences that define their existence across a set of possible worlds.

DuBio has broad potential as infrastructure for data-intensive applications where data quality cannot be guaranteed upfront. In domains such as healthcare, logistics, and smart cities, data is routinely incomplete, conflicting, or sourced from heterogeneous systems. By preserving rather than discarding uncertainty, DuBio enables a pay-as-you-go approach to data quality: integrated data can be queried and used immediately, yielding probabilistic answers, while quality is refined incrementally as evidence accumulates. This makes the system particularly well-suited for iterative data science workflows, decision-support systems, and scenarios involving information extraction from unstructured sources.

DuBio has been released in open source: [GitHub repository] | [Documentation]

Selected publications

Van Keulen, M. (2018). Probabilistic Data Integration. In S. Sakr & A. Zomaya (Eds.), Encyclopedia of Big Data Technologies (pp. 1–9). Springer. https://doi.org/10.1007/978-3-319-63962-8_18-1
Van Keulen, M., Kaminski, B., Matheja, C., & Katoen, J.-P. (2018). Rule-based conditioning of probabilistic data. In Proceedings of SUM 2018 (LNCS, vol. 11142, pp. 290–305). Springer. https://doi.org/10.1007/978-3-030-00461-3_20
Wanders, B., Van Keulen, M., & Flokstra, J. (2016). JudgeD: a probabilistic Datalog with dependencies. In Proceedings of DeLBP 2016. AAAI.
Wanders, B., & Van Keulen, M. (2015). Revisiting the formal foundation of probabilistic databases. In Proceedings of IFSA-EUSFLAT 2015 (p. 47). Atlantis Press. https://doi.org/10.2991/ifsa-eusflat-15.2015.43
Van Keulen, M. (2012). Managing uncertainty: the road towards better data interoperability. IT – Information Technology, 54(3), 138–146. https://doi.org/10.1524/itit.2012.0674
Van Keulen, M., & de Keijzer, A. (2009). Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB Journal, 18(5), 1191–1217.
Panse, F., Van Keulen, M., & Ritter, N. (2013). Indeterministic handling of uncertain decisions in deduplication. Journal of Data and Information Quality, 4(2), Article 9. https://doi.org/10.1145/2435221.2435225

Principle Investigator