December | 2008 | Reproducible Research

Even when lab work and statistical analysis carried out perfectly, microarray experiment conclusions have a high probability of being incorrect for probabilistic reasons. Of course lab work and statistical analysis are not carried out perfectly. I went to a talk earlier this week that demonstrated reproducibility problems coming both from the wet lab and from the statistical analysis.

The talk presented a study that supposedly discovered genes that can distinguish those who will respond to a certain therapy from those who will not. On closer analysis, the paper actually demonstrated that is it possible to distinguish microarray experiments conducted on one day from experiments conducted another day. That is, batch effects from the lab were much larger than differences between patients who did and did not respond to therapy. I hear that this is typical unless gene expression levels vary dramatically between subgroups.

The talk also discussed problems with reproducing the statistical analysis. As is so often the case, data were mislabeled. In fact, 3/4 of the samples were mislabeled. Simply keeping up with indexes is the biggest barrier to reproducibility. It is shocking how often studies simply did not analyze the data they say they analyzed. This seems like a simple matter to get right; perhaps people give little attention to it precisely because it seems so simple.

So, three reasons to be skeptical of microarray experiment conclusions:

High probability of false discovery
Statistical reproducibility problems
Physical reproducibility problems

Most people would agree that reproducible results are important in all areas of science. I think reproducibility is particularly important in areas of science where replication of an experiment or study—where a similar question is addressed using independent investigators, data, and methodology—is highly unlikely. Such studies are typically difficult to replicate because of time, money, ethics, or perhaps all three. In these cases, all we are left with are the data at hand and being able to reproduce the published results from these data is critical.

Much heat has been generated over the question of whether scientists should be forced to make their data and methodology public. Journals such as Science and Nature have adopted data dissemination policies; the National Institutes of Health requires data sharing plans for some of its grants; and the Office of Management and Budget Circular A-110 requires that data generated under federally sponsored research be made available upon request if those data were used in developing a government agency action. While the debate over such dissemination policies is highly relevant, I think it can obscure and cause people to overlook an important question related to reproducible research.

One way I sometimes think of this question is as follows: Suppose a collaborator comes to you and says “I desperately want to make my research reproducible. What should I do?” I don’t mean to frame this as purely a hypothetical question—I have actually had people ask me this before.

The problem right now is that I don’t think proponents of reproducible research (myself included) have a good answer to this question. A typical response might be “make the code and data available”. Yes, but how? If we cannot come up with a concrete and coherent answer to this question for people who are willing to make their work reproducible, we cannot realistically expect to change the minds of people who are currently unwilling to make their research reproducible.

I think there are two important roadblocks that make it difficult to publish reproducible research. The first is the lack of a broad toolset that a wide range of researchers can use to assist them in publishing their data and methodology. There are a number of efforts out there to develop tools, but many of these tools either have important limitations or are only accessible to more sophisticated users (the Sweave/LaTeX combination comes to mind, although it is a great contribution). A related problem involves getting people to use tools that are already out there. For example, I believe the use of version control software is a critical aspect of reproducible research and there are many high-quality software packages available for all operating systems. I personally use git but many others would also fit the bill. I must say I’ve had limited success convincing people they need to use version control systems. I think the basic problem is that it involves learning Yet Another Software Package.

The second roadblock for reproducible research is distribution. Suppose I carefully keep track of all the code I use to analyze my data and am happy to give the code and data to others. How do I do that? Many knowledgeable people will setup a web site for themselves and post code and data on their own web pages. But demanding that everyone create a web site for distributing reproducible research is in my opinion a steep demand. Many researchers do not have this capability and even if they did, it is not clear to me that web pages are the ideal medium for disseminating reproducible research. How much data analysis is done in your web browser?

The distribution problem can be addressed by creating some basic infrastructure. Analogous infrastructure already exists in other domains. Users of the R statistical system have the Comprehensive R Archive Network (CRAN) which is used to disseminate R packages (add-on functionality) to anyone around the world. In practice there is no need to interact with the web site with a browser because R itself can fetch the packages from the Archive and install them without the user ever having to change applications. Similar facilities exist for Perl (CPAN) and TeX (CTAN). Of course, we cannot expect such resources to appear out of thin air. Developing a useful archive requires hardware and administrative time.

I have been trying to develop a system for R users that can be used to distribute reproducible research via a central repository. The software is an R package called ‘cacher‘ and the associated repository is what I call the Reproducible Research Archive. The basic idea of the ‘cacher’ package is to take code that represents a data analysis and cache the code and associated data in a series of key-value databases. This “data analysis cache” can then be packaged and uploaded to the Archive. Each cache package is given a unique ID (via SHA-1) so that it can be referenced by others in a global fashion. On the other side of things, the ‘cacher’ package can download an available cache package and a user can run the code in the package to reproduce the results.

Not all of the abovementioned functionality is complete but many aspects of the ‘cacher’ package are available. There is also a paper in the Journal of Statistical Software that describes the package in greater detail. The advantage of the ‘cacher’ system is that R users have relatively little to learn—just a few functions. Of course, the disadvantage is that it is only available to R users, who are a minority of people conducting data analysis in the world.

There are of course other challenges that I haven’t mentioned that will need to be solved before reproducible research goes mainstream. I think the development of the necessary infrastructure (software and distribution media) is just one important challenge that is critical to its adoption because less technical users need to be able to easily “plug-in” to an existing framework without having to build a piece of it themselves. By learning from experiences in other domains I think we can successfully build this infrastructure and bring reproducible research to a much wider audience.

Roger D. Peng
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health

Reproducible Research

Links and info about reproducible research

Monthly Archives: December 2008

Irreproducible results in neuroscience

BioMed Critical Commentary

Three reasons to distrust microarray results

The volatility of URLs

Distributing Reproducible Research