Organising your data and archiving

Working with data

Organising your data at the start of a project will make it easier to access later not only for yourself and your research partners but also when preparing to share. It's not the most exciting of tasks but is very important.

At the beginning of a project, decisions should be made on where the data will be stored and how it will be named and catalogued. This will prevent duplication and version confusion and make it easier to access if someone leaves the project at a later date.

Records Management have provided some good practice guides for organising your data, including version control to help ensure everyone is working from the same document. The filing systems and naming conventions document contains some useful guidance on creating a comprehensive filing system. Further information can be found from JISC on choosing a file name that is compatible with different operating systems.

During a project you should, of course, use a file format that best suits the way you work. This may well be discipline specific software that is routinely used. If you can be flexible in your approach then you should consider using open and archive friendly formats. Older and less open software should be avoided if at all possible.

The ‘bit list’ of Digitally Endangered Species

Dealing with personal or special category data as defined by GDPR regulations require you to agree on an anonymisation protocol. This should outline how and when anonymisation should be implemented and who has access to the raw data. Do not anonymise at the end of your project, this will waste time and may well result in extra costs.

What to keep

It would be impractical and unnecessary to archive and preserve everything.

Preserving data costs time and money. To consider what needs to be kept and what can be discarded you should ask yourself the following questions:

  • What is needed to validate findings in your thesis/publications?
  • What might others conceivably find useful?
  • How expensive will it be to reproduce this data if it is destroyed?
  • How expensive will it be to preserve?
  • Are you obliged to destroy anything?

Digitise and archive as you go along

Finished working with a particular dataset? Then you should transform it into a more stable, standard format to archive if necessary. Don't leave older files to become unreadable because software is no longer available.

Your archival format should be at least one of the following:

  • Readable using free tools (ideally plain text): so it can be accessed without a potentially expensive license
  • A well-documented standard: so a wide variety of software is available to access it
  • A de facto standard in your research area: so the majority of researchers you share it with can be expected to have access to the right software.

For physical research data it is easier and cheaper to digitise this as the project progresses. Such costs can be incorporated in your funding application.

Guidance on the quality assurance of data

Findable and Accessible

Part of FAIR data, you need to make sure the datasets you share are findable and accessible.

Use trusted repositories, as they will have the appropriate metadata standards. Discipline specific data repositories, supported by funders or the community have tailored fields and descriptions. Complete as many fields as you can and use keywords associated with your data, project and discipline

Digital Object Identifier (DOI) – datasets that can be shared publicly or in a controlled manner should have a DOI so that the record is discoverable.

If you are involved in a large project, the datasets may well be deposited in another institutional data catalogue or a discipline specific repository. If that is the case then it is still worthwhile creating a metadata only record in the Liverpool Research Data Catalogue, using the original DOI. It all aids discovery.

Have you ever lost your data?

How to avoid a data management nightmare

Guidance from UK Data Archive

How-to guide "Five steps to decide what data to keep" (DCC)