Pay-­as-you-go data integration for bio-informatics (PayDIBI)

Scientific research in bio-informatics is often data-driven and supported by biological databases. The Nucleic Acids Research Journal's collection alone of publicly available and free on-line biological databases contains 1330 databases growing by about 5-8% each year. Information contained in biological databases includes gene function, structure, localization, clinical effects, similarities collected from scientific experiments and computational analyses. In a growing number of research projects, researchers like to ask questions that require the combination of information from more than one database. Combining information from several biological databases can be a painstaking process, because of many reasons such as errors, imprecision, (evolving) opinions and consensus, conflicts and ambiguities in overlapping data, and other data quality and trust problems. As a consequence, much effort is necessarily devoted to low-level data coupling and integration tasks significantly slowing down the process of scientific discovery. One estimate is that in scientific workflows 30% of all the tasks are data transformation tasks.

The objective of this project is to develop data coupling and integration technology to support the scientist in the construction of targeted data sets from multiple biological databases and other data sources according to his/her views, opinions, and trust. The approach will be based on recent ideas for “pay-as-you-go” and “good-is-good-enough” data integration to allow scientists to quickly construct a targeted data set that properly reflects the continuous flow of updates to the sources, adapt and clean them according to their own views and opinions, while making full use of the trusted work of other scientists in a non-interfering manner. The research focus lies in the following main scientific challenges: How to adapt and extend the ideas for pay-as-you-go and good-is-good-enough data integration to the bio-informatics domain; How to define a method and tool-support for the quick construction and enrichment of a targeted data set from a set of external data sources; And how to define a method and tool-support for effective continuous improvement of the data quality. We established a relationship with the bio-informatics group of Wageningen University (see http://www.bif.wur.nl/UK/). This group supplies a use case and base data for validation of the to be developed technology in an active bio-informatics research project.

A more detailed description of the project can be found at http://www.utwente.nl/ewi/db/research/currentprojects/paydibi.pdf

Related publications can be found at http://eprints.eemcs.utwente.nl/view/project/PayDIBI.html