UTFacultiesEEMCSDisciplines & departmentsSCSEducationAssignmentsOpen AssignmentsOpen Master AssignmentsMaster thesis project: Privacy verification for sensitive data science

Master thesis project: Privacy verification for sensitive data science

MAster thesis project

Privacy verification for sensitive data science

Type: Master CS 

Start date: as soon as possible

Student: Unassigned

If you are interested, please contact:

Dr. Afshin Amighi
HBO-postdoc Kenniscentrum Creating 010 - Hogeschool Rotterdam

Objective:
There are generally two ways for making sensitive data accessible. The first is the open data after anonymization approach, where personal identifiers are stripped from datasets and generalized before the data is shared. The second is the algorithms-to-data approach, where the data stays secure with the owners, and instead an external party provides a script to the data owners. Only the output of the script is shared, and not the sensitive data itself. 

The remaining gap in the algorithms-to-data approach is that not every script or algorithm is suitable for execution on sensitive data: the algorithm might not preserve anonymity, or worse, directly leak the data itself. To avoid this, the algorithm requires a human audit, which is subtle and time-consuming work. 

We propose to develop a (semi-)automated framework to enforce privacy in this new model of data sharing. The core idea is to define privacy rules and automatically verify whether a submitted analysis program complies with these rules before it is executed on the dataset. 

Some examples of privacy rules are: 

  • “The submitted program must not attempt to leak information." 
  • “The submitted program must not attempt to reconstruct individual-level data.” 
  • “Averaging of rows requires at least 20 rows as input.” 
  • “The script is not allowed to call certain system functionality marked as dangerous.” 
  • “Executed queries must be proven to be safe in accordance with a privacy-preserving logic.” 
  • “Data must not flow directly to certain sinks marked as dangerous.” 

By using techniques from program verification (static or run-time), or policy languages, we aim to build a system where compliance can be checked automatically or with minimal human intervention. 

Some examples of concrete techniques that can be used for this project: 

  • Static analysis, e.g. using allow- and disallow-lists for known safe and dangerous functions, taint/dataflow analysis, and others. 
  • Dynamic analysis, e.g. run-time trace capture, to determine actual script behaviour 
  • Query type systems & formal policy languages. E.g. the PICACHV can formally show that queries respect data policies. 

Where possible, this project will try to apply results to the vantage6 platform, developed at the Netherlands Comprehensive Cancer Organization (IKNL) and eScience center. The platform follows a client-server architecture in which a researcher submits an analytical task or algorithm to a central server, which distributes it to participating data nodes; only aggregated results are returned to the server, ensuring that raw data never leave their original location. 

This means you can choose where you want to focus your efforts within the project. You can choose to focus on the conceptual level, taking the Vantage6 architecture as a blueprint, and seeing where a domain-specific logic can make the most impact. You can also focus on the practical level, by looking at patterns and idioms in the Vantage6 ecosystem, and building an approach based on that.