“The challenge that life science organizations face is not so much analyzing their data, but rather organizing it.”
--- Anthony Philippakis, Broad Institute Chief Data Officer
Data is the life blood of life sciences, biotech, and pharmaceutical companies. But without thoughtful organization and management of that data, the work of scientists and data analysts can be severely hampered. In this post we'll take an introductory look at FAIR data principals that you can use to guide your data management strategy.
Your Data is Your Value
Software companies know that their source code is the source of their value. Therefore, they take special care to organize it with standard folder structures, naming conventions, data versioning, check in/check out procedures, etc. For Life Science, Pharmaceutical, and Biotech companies your data is your value. Life Science companies should be just as rigorous in organizing and maintaining consistency in your data as software companies are with their source code. For that you need a data management strategy. Without such a strategy, individual scientists or groups will each come up with their own strategy, conventions, and processes. While that might be manageable when the company is small, it will become a big headache and drain on efficiency as the company grows.
FAIR Data Principals
Think about all the types of scientific data you may need to organize. Here are just some examples:
- Static referential / knowledge base
- Departmental logistics / planning
- Meetings documentation
- Project high-level documentation
- Project final reporting
- Project scratchpad / temporary analyses
- Project raw/instrument data
- Data intended for external users
- Intentionally deprecated historical data
It might seem overwhelming to create a strategy that encompasses all the many kinds of scientific data your company produces. Fortunately, you don’t have to start with a blank piece of paper and invent a strategy out of whole cloth. FAIR Data Principles (Wilkinson et al., 2016) provide excellent guidance for scientific data management and stewardship. FAIR stands for Findable, Accessible, Interoperable, Reusable.The earlier in the company’s life cycle these principles are adopted, the fewer data management headaches the company will experience down the road.
Developing a Data Strategy
The goals for a scientific data management strategy may include:
- Avoid losing or overlooking data
- Provide context and explicit linkage of data
- Provide common, project-centric data views
- Have frameworks that will allow decision making based on analysis of large data sets
- Level the ability for all stakeholders to quickly interpret data and contribute.
Let’s look at some of the ways that an early stage company might apply FAIR data principles to a data strategy. A company at this stage likely does not yet have complex Data Biospheres and fit-for-purposes databases. They would probably keep their data in files and folders in a cloud storage solution such as Box or OneDrive.
- Data folders should be based on project – NOT on who worked on it. Data lives with Projects – not People.
- Give experiments single succinct unique names/numbers; don’t name by description.
- Use consistent nomenclature for naming of experiments, protocols, etc.
- Do not change names; do not delete metadata.
- Metadata lives near data in the folder hierarchy; explicitly names files associated with it.
- Keep a registry/index file (Excel) explicitly summarizing experiments to allow human search.
- R&D data should live under a central R&D folder, fully accessible to anyone in company who might need access.
- Sensitive data should be explicitly named/catalogued and should be managed by a designated Data Steward. The old saying is “If everyone is responsible, that means no one is responsible.” Name a specific person with responsibility for and control over the management of sensitive data.
- The sharing of internal data with external collaborators should follow the guidance of designated Data Steward. Team members need guidelines on which data is OK to share, how to share it, and with whom.
- Use consistent naming methods to refer to experiments, data types, instruments, protocols, etc.
- Instrument data should be captured in consistent templates.
- Analysis data should follow department-defined workflows and file formats.
- Include as much contextual metadata as possible; retain instrument data; use common vocabularies.
- Define/capture common project workflows; identify which workflow is in use per experiment.
- Explicitly indicate provenance for data and analysis.
This is just an example of how a company could apply FAIR data principals. Each company will have their own processes and workflows to consider.
Data is the lifeblood of Life Science companies and organizing it is no small challenge. FAIR Data Principles can guide you as you design your own rules, procedures, and guidelines for your valuable data. If you create this data strategy as early in your company’s lifecycle as possible, you can avoid the pain and inefficiency of poorly organized, hard-to-find, hard-to-analyze data and the time and expense of major data restructuring projects as the company grows and the quantity of data becomes overwhelming.
This blog post is based on information provided by Charles O’Donnell, Director of Computational Biology at Evelo Biosciences. Evelo Biosciences is pioneering therapies that modulate systemic immune response by acting on the gut-body network - monoclonal microbials.