Guidelines Personal Information | Research Data & Privacy

When working with personal data, various considerations with regard to data protection, privacy regulations and ethical and scientifically responsible behavior should play a role in the data management phase. This page provides an overview of the conditions researchers should be aware of for various tasks, like levels of sensitivity and general regulations for gathering, processing, and storing data.

UT privacy website: codes of conduct to be familiar with and comply with

Check the UT website on Personal Data for relevant Codes of Conduct, such as the 'use of personal data in research' or the 'GDPR', and the VSNU code of conduct and the Code of Conduct for Health Research for dealing responsibly with (personal) data and human tissue in Dutch health research.

Personal Identifiable data

Personal identifiable data is any information that can be used to directly or indirectly identify the person, such as name, photo, email address, social security number, bank details, posts on social networking websites, date and place of birth, mother's maiden name, or biometric records; and any other information that is linked or linkable to an individual, such as a computer IP address, medical, educational, financial, and employment information. A lot of data can be viewed as personal information, depending on the context. Researchers must handle such personal data appropriately, in compliance with EU legislation.
Actually, we distinguish 'personal', 'sensitive', 'special categories', and 'BSN' & 'data relating to criminal convictions and offences'.

Special categories (Sensitive) personal identifiable data

Sensitive personal identifiable data are racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation. GDPR refers to sensitive personal data as “special categories of personal data” and this data has an extra layer of legal protection. Processing of these data is prohibited, however for research there are exceptions: it is allowed if there is explicit consent of the subjects or if it is necessary for scientific research. Hence, you still need a legitimate aim and a legal ground for the processing.

Medical personal data, or health data, is defined as information relating to the health status of a person, encompassing both medical and administrative data. This data is considered sensitive and is subject to strict regulations to protect an individual's privacy. The GDPR and healthcare legislation outline specific rules for its processing and sharing. Anonymized or pseudonymized health data can be used for research purposes, such as studying disease patterns or developing new treatments. The EU also provide guidance in the reuse of health data.

What is the definition of 'processing' under the GDPR?

'Processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.

Legal grounds for lawful processing

Processing of personal data is only lawful if at least one of the six legal grounds as mentioned in the GDPR applies. Check the legal grounds on the UT privacy website.

principles of data processing

If you have a legal ground you can lawfully process personal data. The GDPR has some important principles you need to take into account. Check the principles of data processing on the UT privacy website.

Informed Consent under the GDPR (EU PRIVACY LAW)

One important condition for working with personal data is the permission of the person in question. This informed consent must satisfy certain requirements.

You must be able to show that you have received people’s valid permission to process their personal data. It is important that they grant this permission voluntarily; otherwise, you are not permitted to process their information, or they are entitled to withdraw their permission.

This informed consent must satisfy certain requirements:

Simply obtaining permission is not enough. The information on the basis of which the permission has been given must also be documented. In this way, you can show that you informed the people well and that they gave their permission specifically for this situation.
You must be able to show a clear link between the permission obtained and the personal data you are processing. Permission must be obtained separately for each different purpose.

For more information on informed consent procedures see the BMS Ethical Committee website.

Checklist: informed consent for researchers

Explain as clearly as possible:
- the reason why you are collecting the personal data
- that you will not use the personal data for any other purpose
- When test subjects are under 16, you should also obtain (additional) permission from the subjects’ parents/guardians.

Levels of sensitivity

Dutch law describes personal data as “any information concerning an identified or unidentified natural person, which can lead to the identification of a natural person without unreasonable effort”. It should be noted that a combination of different variables in a dataset might also lead to subjects being identifiable. For example, neither age, place of birth nor newspaper subscription on their own are likely to make subjects identifiable. But when these pieces of information are combined, it becomes much more likely that subjects can be identified. As a general rule, any data about a natural person should therefore be regarded as personal data.

DIRECTLY IDENTIFIABLE

Directly identifiable data is (a combination of) information that can lead directly to a subject, like address, phone number, e-mail, citizen’s service number, ip-address, bank account numbers, etc.

INDIRECTLY IDENTIFIABLE

Indirectly identifiable data is information that does not allow for a direct identification of subjects, but nevertheless permits researchers or third parties to identify subjects without unreasonable effort. For example, a person of a certain age, who has an uncommon profession and lives in a small village can still be easily identified even when all directly identifiable data is deleted.

SENSITIVE INFORMATION

Sensitive information, that is a person’s political or religious affiliation, sexual orientation, medical and criminal records and union membership, should be treated with additional care.

Anonymization/Pseudonymization explained

Anonymization

Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymized, the anonymization must be irreversible. This also implies that even the researcher(s) does/do not have access to personally identifiable information anymore.
Complete anonymization means your dataset should not contain any privacy-sensitive information that can be traced back to a unique living person. You can do this by deleting all personal information and replacing it with an anonymized identifier (Subject 1,2,3, etc). Variables such as Names, address, contact information, citizen’s service, social security and tax numbers (BSN/Sofi), and medical record numbers, must be removed. Dates directly related to the individual (birth, death, admission, discharge) must be re-encoded to years, and postal codes must be re-encoded to four digits. Finally, it is recommended to re-encode specific occupations to classifications of occupations, using the Standard Classification of Occupations (SBC, in Dutch) by Statistics Netherlands or the International Standard Classification of Occupations(ISCO) by the International Labour Organization.

However, anonymization can not be done in all cases, for instance, when you collect 24-hour location data. Be alert on combining different parameters in your dataset. This can sometimes lead to a unique living person as well.

After anonymization, the data cannot be traced back to a person. Datasets that are completely anonymized are not covered by the GDPR (i.e., GDPR no longer applies) and can therefore be shared and made openly available without any restrictions. However, keep in mind the earlier example given under indirectly identifiable data and additionally delete any indirectly identifiable data for specific persons if you think this is necessary!

Before you share your data with anyone else, make sure that the dataset is completely anonymized.
We advise anonymizing your data as quickly as possible. If this is not possible, you will need to pseudonymize.

Check our Practical Guide on Research Data Anonymisation and Pseudonymisation

Pseudonymization

Pseudonymization is “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual. Personal data that has been de-identified, encrypted or pseudonymised but can be used to re-identify a person remains personal data and falls within the scope of the GDPR. Many of the techniques traditionally used to protect privacy in research settings, such as key-coding, fall within the definition of pseudonymization and therefore remain subject to the GDPR.

Coding is justified if you need to be able to link separate datasets about the same individuals. In all other cases, full anonymization is the preferred option.

Please note: The linking files (which establish a link between individuals and their data for the use of researchers) should be encrypted and stored offline. The coding key should not be saved in a folder together with the coded datasets. The environment in which the coding key/linking files are stored must be properly secured. In addition, you should clearly document all individuals who will have access to the coding key.

If there is no need to relate the research data to the personal data anymore, delete both the key and the personal data file.

Check our Practical Guide on Research Data Anonymisation and Pseudonymisation
Created by Data Stewards from BMS & ITC Faculty

Tools for anonymization/pseudonymization

Check our new practical guidance on Research Data Anonymisation and Pseudonymisation

Created by the Data Stewards from BMS and the ITC faculty.
The guide provides practical guidance on two methods of data de-identification: data pseudonymisation and anonymisation. Data de-identification refers to the process of removing or minimising information that can be used to track back to a research participant within a dataset. This guide includes hands-on examples of various techniques for pseudonymising and anonymising different types of research data.

Several anonymization tools are available. Here is a list of anonymization tools (information from https://windowsreport.com/data-anonymization-software/):

ARX Data Anonymization Tool
Download for free from official tool website
ARX is an open source software for anonymizing sensitive personal data. The tool transforms datasets into syntactic privacy models that mitigate attacks leading to privacy breaches.
ARX removes direct identifiers such as names from datasets and adds further constraints on indirect identifiers, such as email addresses or phone numbers. The tools also provides built-in data import facilities for relational databases (MS SQL, DB2, SQLite, MySQL), MS Excel and CSV files.
Anonymizer
Download from Eyedea with free trail. A license is needed if using Anonymizer.
Data can also come in the form of images. If you need a tool to blur images, we recommand Anonymizer. The software detects faces, car number plates and other image information in various scales and orientations and applies blurring filters to make the information unreadable.
The Anonymizer SDK is a standalone C/C++ library for Windows and Linux. You can use it to anonymize street-view photos, webcam photos, or any other photos for which privacy is a major concern.
Imperva Camouflage Data Marking Software
Download Imperva software free tryout. A license if needed to use this tool.
Imperva Camouflage’s data making tool removes the sensitive data hindering test, outsourcing, and analytics. The tool de-identifies sensitive data, and retains the realism and functionality of the original data set.
The data categories this tools can work with include names, addresses, credit cards, SSN/SIN, phone, and more.
NLM Scrubber
Download NLM Scrubber
NLM-Scrubber is a new, free clinical text de-identification tool. The software is currently in its initial beta stage, but you can already try it out if you’re curious. The good news is that its developers will release a non-beta version of the tool. NLM-Scrubber is to be mainly used for de-identifying medical documents.
More information can be found here.
Aircloak
Try Aircloak
Aircloak is a new data anonymisation solution that you can use to keep personal identifiable information private.
The software does not affect data quality or quantity in any way. You can still use the respective data to analyze information and while complying with the latest user privacy regulations.

Risk assessment

These levels of sensitivity can be directly translated into a risk assessment – how large are the consequences for respondents of their data being leaked, stolen, or otherwise becoming public?

SURFnet and IGS datalab use the guidelines set out in the table below for risk assessment. To which exact class a particular data set belongs is a somewhat subjective decision, which depends on the particular combination of individual pieces of information, and the context in which they occur.

Risk class	Type of information	Countermeasures
0 (public)	Public information; e.g. professional e-mail address.	No specific measures besides those mandated by the WBP.
1 (basic)	Limited amount of personal data regarding the connection between respondent(s) and organization(s); e.g. student enrollment.	Standard information protection measures are adequate; e.g. password secured access.
2 (increased risk)	Special personal data; e.g. economic status, dyslexia statement.	Increased information protection measures; e.g. encryption
3 (high risk)	Special personal data; e.g. psychological evaluation, medical records.	Highest possible level of security measures; e.g. encryption + off-line storage.

Working with data / Rights of data subjects (respondents)

RESPONDENTS’ RIGHTS

Respondents have the right to ask the responsible party (in practice such a request would likely be made to you – the researcher) whether or not personal information regarding them is used. Such a request needs to be sufficiently precise, i.e. the respondent needs to make clear in what project he thinks his data to be used. The data in question also needs to be directly identifiable.

If personal data regarding the respondent is indeed on file, the respondent has the right to view said data, and request the data be changed, supplemented or deleted if the data is factually inaccurate, irrelevant to or inadequate for the research, or otherwise violates any legal requirements.

CONTACT INFORMATION

Contact information, that is, directly identifiable information, should be separated from other information. This separation should be as thorough as possible. Ideally, both datasets should be stored in different physical locations, and have different access protocols. Both datasets can be linked with an otherwise meaningless administrative identification number. Contact information should be destroyed when it is no longer necessary.

Any time contact information is gathered, it should be kept separate from the main research data. Further, such contact information should be deleted as soon as it is no longer reasonably required. The potential consequences of high-risk data becoming public can be (partially) negated if such data is either anonymized, or at least separated from contact information.