Data Protection and Reproducibility

Planning how research data is to be managed during a research project naturally raises questions about making data secure, especially if you are collecting, using and analysing personal or special category data.

GDPR

The GDPR regulations affect pseudonymised or identifiable personal or special category data. More information on what you need to consider when your research involves such data can be found here.

The University of Liverpool’s Data Protection Officer (Dan Howarth) has produced three short videos about GDPR for researchers, covering principles and consent.

GDPR for Researchers - Introduction

GDPR for Researchers- Principles

GDPR for Researchers - Consent

Storage and information security

It is important that you keep your research data, whether it is personal or special category or not, safe and stored appropriately.

For more details about storage and the Active Data Store.

The Knowledge Base has information on the most appropriate place to store your data, including research data.

Survey Software

The University supports several options for conducting surveys in the course of academic research, including Microsoft Surveys and JISC Online Surveys.

The Knowledge Base has more information on these options and others. Please note, using Jisc Online Surveys requires an access request through the self service portal.

Secondary Data

If your research involves analysing data created by a third party you should check the third parties’ requirements regarding storage, access, length of availability and destruction.

For colleagues who work with NHS data, you will need to know how to comply with the requirements of the Digital Security and Protection Toolkit (DPST). This Knowledge Base article from IT services is essential reading ‘Data Security and Protection Toolkit status for my NHS data research application’

Accessing Government and Administrative data

As a University of Liverpool researcher, you can access a wealth of secondary data from the following sources via the University of Liverpool Data SafePod

Anonymisation

Anonymisation

Sharing data that contains personal information is often problematic from a legal point of view, especially with the introduction of the GDPR in May 2018. The best way of dealing with the legal problems is to de-personalise the data. This transforms the data from ‘personal data’ by replacing or removing information that may identify an individual.

The following methods help in de-personalising/anonymising a dataset

  • Removal of direct identifiers. This can include, but is not restricted to, identifiers such as names, dates, geographic information, telephone numbers, email addresses, etc
  • Reduction in precision. For instance this could be applied to remove day and month from dates of birth, which are highly identifying, and leave year of birth which is more effective at preserving anonymity. Post code information could be reduced to Post code district (eg L69) or for even less precision only the post code area (eg L) could be retained.
  • Rather than include the raw data itself, it may be more advisable to group the data instead. Instead of including age, a band of ages could be introduced – 16-25, 26-35, 36-45 etc. Care should be taken at the upper and lower ranges of certain variables to ensure anonymity is preserved, so taking the age example there may be very few people in a dataset over the age of 90 and the band may have to be modified to take this into account.
  • Textual data should be thoroughly searched for identifying information such as the direct identifiers listed above. When found these identifiers should be replaced with a consistent pseudonym. Where search and replace techniques are used, you should exercise care to ensure wrongly spelled identifiers are not missed from the procedure. In many cases given the time and effort required to check textual data it may be worth considering how much data is really necessary and how much can be discarded before sharing takes place.

Anonymisation of data is not an exact science and throughout the process you should be aware of the potential for re-identification. If you consider there may be a high risk of your research subjects being re-identified (for instance by combining the data with other easily-obtainable datasets), it may be appropriate to control distribution by using data sharing agreements.

The RDM team have produced a short introductory video about anonymisation

The UK Data Archive has guidance about anonymising qualitative or quantitative data in a research setting.

There are several web-based tools that can help you manage your data, cleanse it and ensure it is as anonymous as it can be. More details can be found on our Research KnowHow page.

Reproducibility

Research data management and sharing research data is one part of the Open Research movement that embodies ideas of best research practice by opening access to results, data, protocols and other aspects of the research process. Promoting activities such as data sharing and public pre-registration of studies also contributes to addressing issues of reproducibility and replicability.

The Liverpool Research Data team work closely with our UKRN leads. At a 2020 workshop on data management and reproducibility Prof Greenhalf presented the following video, addressing reproducibility and Covid-19 research.

Data sharing promotes research integrity and reproducibility. Sharing data opens opportunities for scientific enquiry through the promotion of innovation via new data uses and collaboration. It maximises transparency and helps ensure the reliability of the scientific record. The UK Reproducibility Network has produced a primer about Data Sharing.