BMS - DataLab

Guidelines Personal Information

When working with personal data, various considerations with regard to data protection, privacy regulations and ethical and scientifically responsible behaviour should play a role in the data management phase. This page provides an overview of the conditions researchers should be aware of for various tasks, like levels of sensitivity and general regulations for gathering, processing and storing data. 

Personal Identifiable data 

Personal identifiable data is any information that can be used to directly or indirectly identify the person, such as name, photo, email address, social security number, bank details, posts on social networking websites, date and place of birth, mother's maiden name, or biometric records; and any other information that is linked or linkable to an individual, such as a computer IP address, medical, educational, financial, and employment information.

Sensitive personal identifiable data

Sensitive personal identifiable data are racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation. A lot of data can be viewed as personal information, depending on the context. Researchers must handle such personal data appropriately, in compliance with EU legislation.

Check the UT website on Personal Data for relevant Codes of Conduct, such as the 'use of personal data in research' or the 'GDPR'.

Personal Data in research Poster

Scientific research often uses personal data of participants in their research. All processing of personal data is subjected to the General Data Protection Regulation (GDPR). This poster is designed to help you address the different steps before, during, and after your research to comply with the GDPR. Make sure you read the explanation with the poster.


Levels of sensitivity

Dutch law describes personal data as “any information concerning an identified or unidentified natural person, which can lead to the identification of a natural person without unreasonable effort”. It should be noted that a combination of different variables in a dataset might also lead to subjects being identifiable. For example, neither age, place of birth nor newspaper subscription on their own are likely to make subjects identifiable. But when these pieces of information are combined, it becomes much more likely that subjects can be identified. As a general rule, any data about a natural person should therefore be regarded as personal data.

DIRECTLY IDENTIFIABLE

Directly identifiable data is (a combination of) information that can lead directly to a subject, like address, phone number, e-mail, citizen’s service number, ip-address, bank account numbers, etc.

INDIRECTLY IDENTIFIABLE

Indirectly identifiable data is information that does not allow for a direct identification of subjects, but nevertheless permits researchers or third parties to identify subjects without unreasonable effort. For example, a person of a certain age, who has an uncommon profession and lives in a small village can still be easily identified even when all directly identifiable data is deleted.

SENSITIVE INFORMATION

Sensitive information, that is a person’s political or religious affiliation, sexual orientation, medical and criminal records and union membership, should be treated with additional care.

Pseudonymization

Pseudonymization is “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual.
Unlike anonymous data, pseudonymous data remains subject to the remit of the GDPR. Many of the techniques traditionally used to protect privacy in research settings, such as key-coding, fall within the definition of pseudonymization and therefore remain subject to the GDPR.

Coding is justified if you need to be able to link separate datasets about the same individuals. In all other cases, full anonymization is the preferred option.

Please note
: The linking files (which establish a link between individuals and their data for the use of researchers) should be encrypted and stored offline. The coding key should not be saved in a folder together with the coded datasets. The environment in which the coding key/linking files are stored must be properly secured. In addition, you should clearly document all individuals who will have access to the coding key.

If there is no need to relate the research data to the personal data anymore, delete both the key and the personal data file.

Anonymization

Anonymization means there is no subject in the dataset that can be traced back to a unique living person. 
Complete anonymization means your dataset should not contain any privacy-sensitive information. You can do this by deleting all personal information and replacing it with an anonymized identifier (Subject 1,2,3, etc). Variables such as Names, address, contact information, citizen’s service, social security and tax numbers (BSN/Sofi), and medical record numbers, must be removed. Dates directly related to the individual (birth, death, admission, discharge) must be re-encoded to years, and postal codes must be re-encoded to four digits. Finally, it is recommended to re-encode specific occupations to classifications of occupations, using the Standard Classification of Occupations (SBC, in Dutch) by Statistics Netherlands or the International Standard Classification of Occupations(ISCO) by the International Labour Organization.

However, anonymization can not be done in all cases, for instance, when you collect 24-hour location data. Be alert on combining different parameters in your dataset. This can sometimes lead to a unique living person as well. 

After anonymization, the data cannot be traced back to a person. Datasets that are completely anonymized are not covered by the GDPR (i.e., GDPR no longer applies) and can therefore be shared and made openly available without any restrictions. However, keep in mind the earlier example given under indirectly identifiable data and additionally delete any indirectly identifiable data for specific persons if you think this is necessary!

Before you share your data with anyone else, make sure that the dataset is completely anonymized.  

We advise anonymizing your data as quickly as possible. If this is not possible, you will need to pseudonymize. 

Risk assessment

These levels of sensitivity can be directly translated into a risk assessment – how large are the consequences for respondents of their data being leaked, stolen, or otherwise becoming public?

SURFnet and IGS datalab use the guidelines set out in the table below for risk assessment. To which exact class a particular data set belongs is a somewhat subjective decision, which depends on the particular combination of individual pieces of information, and the context in which they occur.

Risk class

Type of information

Countermeasures

0 (public)

Public information;

e.g. professional e-mail address.

No specific measures besides those mandated by the WBP.

1 (basic)

Limited amount of personal data regarding the connection between respondent(s) and organization(s);

e.g. student enrollment.

Standard information protection measures are adequate;

e.g. password secured access.

2 (increased risk)

Special personal data;

e.g. economic status, dyslexia statement.

Increased information protection measures;

e.g. encryption

3 (high risk)

Special personal data;

e.g. psychological evaluation, medical records.

Highest possible level of security measures;

e.g. encryption + off-line storage.

Working with data

RESPONDENTS’ RIGHTS

Respondents have the right to ask the responsible party (in practice such a request would likely be made to you – the researcher) whether or not personal information regarding them is used. Such a request needs to be sufficiently precise, i.e. the respondent needs to make clear in what project he thinks his data to be used. The data in question also needs to be directly identifiable.

If personal data regarding the respondent is indeed on file, the respondent has the right to view said data, and request the data be changed, supplemented or deleted if the data is factually inaccurate, irrelevant to or inadequate for the research, or otherwise violates any legal requirements.

CONTACT INFORMATION

Contact information, that is, directly identifiable information, should be separated from other information. This separation should be as thorough as possible. Ideally, both datasets should be stored in different physical locations, and have different access protocols. Both datasets can be linked with an otherwise meaningless administrative identification number. Contact information should be destroyed when it is no longer necessary.

Any time contact information is gathered, it should be kept separate from the main research data. Further, such contact information should be deleted as soon as it is no longer reasonably required. The potential consequences of high-risk data becoming public can be (partially) negated if such data is either anonymized, or at least separated from contact information.