By Bevin P. Engelward (MIT Superfund Research Program Director) and Amy L. Nurnberger (MIT Data Management Services Program Head)
Under the leadership of Director Dr. William Suk, the NIEHS Superfund Research Program is playing a pioneering role in enabling the development of novel tools for leveraging big data in new and exciting ways. Big data comes in many forms, ranging from life-science based data sets for measuring which genes are turned on and off, to engineering modeling of the spatiotemporal dynamics of contaminants in our environment. Not only are large data sets being created faster and more efficiently, but so are innovative tools for combining data sets. To fully leverage these large data sets, we need to be able to find them, access the data, and manipulate the data. In essence, the goal is to reuse data to gain new and deeper understanding and greater predictive capacity. In recognition of the importance of these activities, the NIEHS SRP is adopting innovative strategies for making data findable, accessible, interoperable, and reusable; in other words, making data FAIR1, 2.
One of the first questions is often “how can FAIR be accomplished?” It turns out that the most important step toward achieving FAIR data is to assign and share effective metadata. Metadata is data about data. Some basic elements of metadata might include the name of the person who generated the data, the date when data was collected, and the place where the experiments were performed. While these are obviously valuable attributes, metadata needs to be much more complete in order to enable FAIR data. As a start, think about what essential attributes you would need to know about data before reusing it. For example, for an experiment where cultured mammalian cells are exposed to a carcinogenic chemical found in Superfund Sites, you need to know what the exposure was exactly, including dose and duration, and you would want to have the instructions that were used for the dosing regimen.
To achieve the goal of creating and storing metadata along with its research datasets, the MIT SRP is adopting and adapting the SEEK3 architecture with support from the NIEHS Superfund Research Program. The most exciting aspect of SEEK is that it structures the data and metadata so that computers can be programmed to find and use submitted data. This combination of structured data and metadata is known as “machine-actionable data” and is accessible via computer programs that can harvest preexisting data sets for novel analyses. These approaches are effective when there are repositories wherein properly structured data and metadata reside. On the other hand, there are cases where data are in a form that are primarily human understandable rather than machine-actionable. In this case, the SEEK architecture can be used to enable users to identify the existence of data sets that might be useful for further transformation and analysis.
SEEK is an online data management platform developed for the purpose of enabling data integration across heterogeneous data types in order to model (and ultimately predict) biological outcomes. Although SEEK was originally developed for the purpose of leveraging data for systems biology, the architecture was designed to accommodate a wide range of data types, making it ideal for all MIT SRP members, including those from the environmental science and engineering projects. In essence, the SEEK platform facilitates data sharing by providing a structure for metadata creation that makes it possible to accurately describe datasets and to link corresponding dataset. Ultimately, this makes it possible for researchers to find relevant data sets. Data are not only findable in SEEK, but the metadata structure also makes access possible. Integration of data into the SEEK architecture allows the key information needed to access data from publicly available repositories to be entered for each dataset. Making data findable and accessible is key to data integration across diverse data sets. In other words, you can answer research questions without collecting a single sample! For dataset creators, the impact of their work can be extended far beyond the normal audience. With appropriate citation, supported by good metadata, they get credit for the datasets they have authored whenever they are re-used.
While the SEEK platform makes it easy to add metadata at the time when someone submits their data into a publicly available repository, to be maximally effective, the SEEK platform needs to be dynamic, enabling creation of metadata as data are generated in real time. This is critical because it often takes years between the time of data creation and the time of data deposition, and by the time data are deposited, important metadata and associated supporting information can be lost. For example, a graduate student might inherit a project from someone who has graduated. When asked for key metadata, such as which protocol was used for analysis, the new student might not know with certainty, especially since protocols evolve over time. At MIT, we are tackling this problem by re-engineering the SEEK architecture to make it easy for people to submit metadata in real time. This exciting advance promises to greatly accelerate research by providing researchers with the essential details that are needed for data reuse.
It is interesting to note that the SEEK architecture is useful for documenting and sharing information within a lab, as well as between labs. For example, someone might want to find samples that had been collected by a former labmate. Historically this has been quite difficult, as one might not know which freezer to search, let alone which box to look in. SEEK makes it easy to link specific samples not only with the data generated, but it also makes it easy to indicate where a sample is located (and equally importantly, whether or not the sample still exists, since often samples are used up).
As more and more Superfund data get uploaded into publicly accessible frameworks, there will in turn be countless opportunities to combine data sources in new and exciting ways. Importantly, this creates an opportunity to integrate data across disciplines. This is a particularly exciting aspect of the program, since Superfund is unique among research programs in the incredible diversity of data, ranging from biological responses to environmental fate and transport data. In fact, with its rich diversity of data types, Superfund is the ideal program to propel data science into new and unknown territory. By combining data in new ways, we will increase our potential to make valuable predictions, such as knowing what the impact of cleanup will be on the health of people living near a particular Superfund Site. Stay tuned as we open our eyes to new ways of reusing and integrating data, enabling better science and thus a greater impact on the world around us!
 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) doi:10.1038/sdata.2016.18