Damian Sendler In psychology, as in many other fields, routine data sharing, defined here as the publication of the primary data and any supporting materials required to interpret the data acquired as part of a research study, is still in its infancy. With increased focus on reproducibility and more funder mandates calling for data sharing, the debate over data sharing is shifting from one of benefit or harm to that of what data should be shared and how it should be shared. Data generated by individual labs in the course of mostly hypothesis-driven research is what we’re talking about when we talk about sharing the so-called “long tail” in this overview. In discussing data sharing attitudes, cost-benefits, best practices, and infrastructure, we draw on examples from other industries. Data publishing is an essential part of scholarship in the twenty-first century, according to our view. The infrastructure for sharing a wide range of data is also in place, despite the fact that some issues remain unresolved in terms of how to share data and what data to share.
Damian Jacob Sendler Because we are capable, is the short answer (and we should). Aside from what could be published in journals or books, sharing data was impossible before computers and the internet. This led to an academic publishing culture where data were considered disposable after some specified regulatory period, and certainly where the production of a data set on its own wasn’t considered a work of scholarship. As a result, our journals and books were a way of preserving our hypotheses, experiments, analysis, and findings from our collected data.
Dr. Sendler Prose isn’t capable of capturing the full potential of data sets, but the scientific article has been so prized for so long that the current push to rethink how scientific data is communicated has met with resistance across fields. The resistance is particularly acute in fields such as psychology, which are characterized by relatively small data sets produced by small teams of scientists through complex experimental designs, so-called “long tail” data.
Let’s start with some terminology. According to this article, data is anything that is collected or assembled for analysis as part of an investigation and that is used to draw conclusions about the study’s results and findings. The term “primary data” refers to information that has been gathered for a study in its unaltered or minimally processed form. Information about data or a set of data is known as “metadata.” Descriptive metadata (such as the subjects and experimental paradigms) are of particular interest in this section. Using Borgman’s definition of data (Borgman, 2015), we note that while almost anything can be considered data, it only becomes data when it is used to support a claim (pg 27). A wide range of psychological data can be covered by this definition, according to the authors of this paper. If you want to learn more about data and its role in scholarship, we recommend reading Borgman (Borgman, 2015), in particular chapter 2 for more information on data definitions.
Research data and other products are being developed by nearly all major funding agencies in the United States and abroad (Gewin, 2016; McKiernan et al., 2016). For open science, which at a minimum includes the sharing of research articles, data and code, both human and machine-based considerations are driving the movement. New technologies (the internet, cloud storage, and mobile devices), as well as the promise of new insights from increased human and machine access to research outputs, are reflected in the need for a modernization of our scholarly communications system.
Damian Sendler
There’s a lot of buzz these days about machine-based access to data and how it can unlock new research questions and insights that can’t be gleaned from studying individual data sets, especially in the age of big data and data science (Perrino et al., 2013). Our current practices of analyzing effect sizes from pooled data across studies arose because we didn’t have access to the raw datasets that we do today. There was an urgent need for a way to extract useful information from the cryptic records of inferential data analyses contained in the abbreviated reports of research in journals and other printed sources, and thus the development of meta-analysis was born.” The Glass (2000). Because it allows for more powerful analyses and, especially when negative data is shared, fewer biases than current methods, sharing data sets with individual participant data can be considered the next evolution of meta-analysis (Cooper & Patell., 2009; Perrino et al., 2013). “Meta-analysis needs to be replaced by archives of raw data that allow the construction of complex data landscapes that depict the relationships among independent, dependent, and mediating variables,” Glass (2000) writes.
Damian Jacob Markiewicz Sendler There are numerous advantages to data sharing, including promoting progress, improving transparency, accountability and reproducibility and increasing the power and breadth of psychological studies (BSA, 2015b). More and more people are calling for the creation of basic culture not only for a human-centric (i.e., routine availability of data underlying a paper for others to reuse) but an e-Science vision, where researchers conduct their research digitally and where data standards and programmatic interfaces make it easy for machines to get access to and consume large amounts of data.
Open Science and eScience have many flavors, but they both demand that the domain’s concepts, researchers, and instruments have a digital presence that is designed for both humans and machines. This is the case for eScience. As data is moved from one repository to another, it’s crucial that the process of turning ideas and resources into electronic products is built on a solid digital foundation within the original laboratory. An essential component is ensuring that research products can be connected to computational resources that can perform at scale, as well as being able to trace their origins through a user’s interactions with them. In order to ensure the long-term viability of psychological research, data must be dependable and hosted by trustworthy repositories (see Lishner, 2012 for a concise set of core recommendations to improve the dependability of psychological research). APIs (application programmatic interfaces) must be used to find, access, and reuse digital artifacts, with minimal restrictions, in order to meet this requirement. We’ll talk about these requirements in more detail later, but they’ve recently been summarized as the “FAIR” data principles”
Open, eScience claims to have real-world examples to back up its claims. That is, can researchers do anything meaningful with a plethora of independently collected data? One of the most compelling examples of how sharing data and new data science approaches can lead to new findings is the experience of spinal cord and traumatic brain injury in biomedicine. Despite numerous promising studies in the laboratory, translational research in both fields has had a poor track record for reproducibility and leading to clinically meaningful outcomes (Steward, Popovich, Dietrich, & Kleitman, 2012). For this reason the US National Institutes of Neurological Disease and Stroke (NINDS) established the “Facilities of Research Excellence—Spinal Cord Injury,” a group dedicated solely to reproducing the most important findings in the field of spinal cord injury. The results were dreadful, and they brought to light the difficulties of reproducing paradigms and protocols across different laboratories. There are many psychologists who are familiar with these efforts, but the issue of reproducibility is at the forefront of psychology today in many different fields.
These findings led the scientific community to release their individual data sets (Visualized Syndromic Information and Outcomes for Neurotrauma-SCI; VISION SCI)2 so that they could be aggregated across numerous labs and thousands of animals with spinal cord injuries (Nielson et al., 2014). Animal care and laboratory records, referred to as “file drawer” and “background” data, as well as primary data from researchers were all included (Wallis, Rolando, Borgman, ackson, & Brinkley, 2013). Using this method, researchers were able to sample more of the “syndromic space” (Nielson et al., 2014), which is what Ferguson and colleagues call the “full presentation and course of a disease or syndrome given many varying initial and subsequent conditions” (Nielson et al., 2014). (e.g., extent and location of injury, physiological state, level of care). One piece of the multi-dimensional space representing the effects of spinal cord injury was provided by each research project that was undertaken. More accurate predictive models, new therapeutic areas, and more robust cross-species biomarkers of functional recovery have resulted from this and similar efforts in TBI Rather than just formulating hypotheses and protocols and observing results, researchers were able to gain a deeper understanding of the phenomenon by sharing all of their data, good and bad. NIH and the scientific community also worked together to define common data elements (CDE’s) and minimal information models that researchers in these fields should collect on a regular basis.
Damian Sendler Arguments based on utility (or the lack thereof) generally assert that data collected during the most complex and intricate experiments have little value because no one else would understand them beyond the large prospective data sets released to the public. As a result, many researchers are baffled by the idea that sharing their data could help advance the field of science and, in fact, are concerned that it could harm it. According to the Open Science Collaboration et al., 2015, concerns about reproducibility in the behavioral sciences could lead to poor or at least non-reproducible science as a result of a lack of sharing. In its infancy, data science relies on a steady stream of data from which to draw its first conclusions. Data can be put to a variety of uses that we haven’t even thought of yet, as our examples show, and as the new data science proves.
The most compelling utilitarian argument in favor of data sharing is that of transparency and reproducibility. Making primary data available for inspection with an article is hard to argue against regardless of whether or not it can be re-used. An interested party can inspect the measurements made with full access to the data. As a result of data audits or suspicions sparked by a lack of data, several high-profile studies have revealed research irregularities at best and outright fraud at worst (McNutt, 2015; Yong, 2012). Graduate students in psychology were taught to ensure that all records, including data records, were preserved for five years after a study was published (Association, 2010). According to a look around an older researcher’s laboratory or office, these records are kept for much longer than they should be. As a result of having the original data, an independent researcher can replicate the results of analyses or try new analysis methods on the same data to see if results can be reproduced.
Damian Jacob Sendler
It is possible that researchers who share their data may be harmed if others benefit from it before the original researchers can fully mine it. This is especially true for early-career researchers, who risk being penalized for the time and effort they put into gathering and preparing their data for publication in favor of publishing fewer research articles. It is true that younger scientists are less likely to want to share their data than older researchers in surveys of data sharing attitudes (Tenopir et al., 2011; Tenopir et al, 2015).
Additionally, researchers are concerned that their data will be used against them, particularly in the case of replication studies (“hostile replications,” “weaponizing data”), which can occur when errors are discovered or when someone tries to deliberately undermine a study by performing multiple re-analyses of the data and selective reporting of results (Lewandowsky & Bishop, 2016). According to Rouder (Rouder, 2015), “the self-appointed replication police who are viewed in some quarters as trying to shame otherwise good researchers” cause researchers to feel “professionally vulnerable” (p. 1068). Some researchers who participated in the psychology replication project expressed negative feelings about the experience. (Bohannon, 2014). While these concerns are understandable, at least one study has found that retraction does not harm scientists’ reputations if rigorous standards are maintained (Ebersole et al., 2016).
The relationship between data, scientists, and a scientific work is at the heart of many debates over data sharing. Is data truly a “supplement to the written record of science” (Wallis et al., 2013 pg 2), or is it something that can be recycled? Are they owned by individual labs (i.e., research assets that have been amassed over time)? Or, do the introduction, materials and methods, results and discussion all include data as well? No matter how difficult and time-consuming it may have been to collect the data, can a data set be more than just a collection? Is it possible for a data set to be an object of scholarship in its own right? This means that data sets should be given the same status as narrative works and viewed as publications of their own accord. These published data would be recognized and rewarded if their analysis resulted in a new discovery. Rather than decrying the re-use of data, it is celebrated (Longo & Drazen, 2016).
Data policies and arguments for and against sharing take on a different flavor depending on whether one views data as proprietary research resources or as part of the integral output of a research study. As long as the data are viewed as just another research resource, the researcher owns them, which may necessitate sharing them if requested. APA ethical guidelines explicitly state that psychologists “…should not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis…” (American Psychological Association, 2010), although few mechanisms exist to track compliance with these requests” (Vanpaemel, Vermorgen, Deriemaecker, & Storms, 2015). It’s common for researchers to thank a professor for providing data, but not to formally credit him or her for their contributions. It is not permitted to distribute or post modified versions of the data without the express consent of the author. Compensation for the direct costs of producing a usable copy of their data may even be sought by the producers of the data (Vanpaemel et al., 2015). Embargo periods give the original researchers time to fully exploit their data before it is freely used by others if researchers give up their sole access to the data they collected. It’s also a good idea to release data under a license that restricts the data’s use without permission (i.e., no commercial use).