[M] Creating fake data that maintains dependencies

Master Assignment

Creating fake data that maintains dependencies

Type: Master CS

Location: CTcue

Period: TBD

Student: (Unassigned)

If you are interested please contact :

Introduction of company:

CTcue is a small company that builds a search engine for patient populations and provides an easy way to collect data about that population. Some examples for use cases for hospitals are clinical trials and Quality indicators (statistics about hospital performance for insurance companies, government etc.). A major challenge is that valuable information is archived as text (e.g. reports or notes), making it unavailable for analysis without using natural language processing. We have created a pipeline that analyses dutch medical text, such that the full scope of patient health records can be used, both structured and unstructured data. The steps in the pipeline include (but are not limited to): measurement extraction, concept extraction, context classification, temporal classification. Currently our solution is implemented in 25 hospitals. The company consists of a team of 14 people and is located in Amsterdam on Science Park.

Project description:

One of the challenges of working with medical data is that the data is strictly confidential. In order to test solutions outside of a hospital you therefore need data that is 100% fake. However, the data also has many complex relations. Examples of these relations are which medication fits which kind of diseases, which range of values are normal for measurements, which specialism is most likely for what kind of content and so forth. An example of what a data generator looks like is Synthea. This tool does not fit our data model, creates english data and uses handcrafted definitions for data, but parts such as generating data based on a model might be reusable or at least serve as an inspiration. We do have a fake text generator for dutch medical text, so the focus will be on generating structured data that creates realistic but fake patients.

Expected product:

A model of the data that can be used to create realistic test data.

Available resources:

The solution can be developed and tested on data generated by Synthea, and be run by your CTcue supervisor on real hospital data to create a representation of dutch hospital data. this does mean that the method needs to be adaptable to different data models.