Category Archives: Uncategorized

CiSE special issue on reproducible research

Computing in Science and Engineering has just come out with a special issue on reproducible research.  (When you first visit the link, you need to click on “vol 11.” The page is doing some fancy JavaScript that makes it impossible to link directly to the issue.)

The following articles on RR are included.

Guest Editors’ Introduction: Reproducible Research

Sergey Fomel, University of Texas at Austin
Jon F. Claerbout, Stanford University

Reproducible Research in Computational Harmonic Analysis
David L. Donoho, Stanford University
Arian Maleki, Stanford University
Inam Ur Rahman, Apple Computer
Morteza Shahram, Stanford University
Victoria Stodden, Harvard University

Python Tools for Reproducible Research on Hyperbolic Problems
Randall J. LeVeque, University of Washington

Distributed Reproducible Research Using Cached Computations
Roger D. Peng, Johns Hopkins Bloomberg School of Public Health
Sandrah P. Eckel, Johns Hopkins Bloomberg School of Public Health

The Legal Framework for Reproducible Scientific Research: Licensing and Copyright
Victoria Stodden, Harvard University

BioMed Critical Commentary

I just found out about BioMed Critical Commentary. Here’s an excerpt from the site’s philosophy statement.

The current system of scientific journals serves well certain constituencies: the advertisers, the journals themselves, and the authors. It is the underlying philosophy of BioMed Critical Commentary to serve the readers in preference to any other constituency.

In particular, this site could serve as a public forum for criticism that journals are not eager to publish. It could be a good place to discuss specific examples of irreproducible analyses.

Three reasons to distrust microarray results

Even when lab work and statistical analysis carried out perfectly, microarray experiment conclusions have a high probability of being incorrect for probabilistic reasons. Of course lab work and statistical analysis are not carried out perfectly. I went to a talk earlier this week that demonstrated reproducibility problems coming both from the wet lab and from the statistical analysis.

The talk presented a study that supposedly discovered genes that can distinguish those who will respond to a certain therapy from those who will not. On closer analysis, the paper actually demonstrated that is it possible to distinguish microarray experiments conducted on one day from experiments conducted another day. That is, batch effects from the lab were much larger than differences between patients who did and did not respond to therapy. I hear that this is typical unless gene expression levels vary dramatically between subgroups.

The talk also discussed problems with reproducing the statistical analysis. As is so often the case, data were mislabeled. In fact, 3/4 of the samples were mislabeled. Simply keeping up with indexes is the biggest barrier to reproducibility. It is shocking how often studies simply did not analyze the data they say they analyzed. This seems like a simple matter to get right; perhaps people give little attention to it precisely because it seems so simple.

So, three reasons to be skeptical of microarray experiment conclusions:

  1. High probability of false discovery
  2. Statistical reproducibility problems
  3. Physical reproducibility problems

Seven presentations on RR

Sergey Fomel just told me about a special session on reproducible research at the “Berlin 6 Open Access Conference” in Dusseldorf, Germany. Presentations from the conference are available online.

Sergey Fomel and Sünje Dallmeier-Tiessen gave presentations in geophysics. Patrick Vandewalle and Jelena Kovacevic gave presentations in signal processing. Mark Liberman, Kai von Fintel, and Steven Krauwer gave presentations related to language and technology.

Video of the presentations is available here.

The Fastware project

Thomas Guest has a new blog post Books, blogs, comments and code samples discussing the challenges of writing a book that contains code samples, may be rendered to multiple devices as well as paper, etc. He points to a project by author Scott Meyers called Fastware that explores ways of meeting these challenges. I haven’t had time to explore Fastware yet, but it sounds like it is concerned with some of the same problems that come up in reproducible research.

Biggest barrier to reproducibility

My previous post discussed Keith Baggerly and his efforts as a “forensic bioinformatician.”

In that article, the reporter asks Keith to name the biggest problem he sees in trying to reproduce results.

It’s not sexy, it’s not higher mathematics. It’s bookkeeping … keeping track of the labels and keeping track of what goes where. The thing that we have found repeatedly in our analyses is that it actually is one of the most difficult steps in performing some of these analyses.

I’ve seen presentations where Keith discusses specific bookkeeping errors. Quite often columns get transposed in spreadsheets, so researchers are not analyzing the data they say they are analyzing.

Forensic bioinformatics

The October 2008 issue of AMSTAT News has an article entitled “Forensic Bioinformatician Aims To Solve Mysteries of Biomarker Studies.” The article is about Keith Baggerly of M. D. Anderson Cancer Center and his efforts to reproduce analyses in bioinformatics papers.

The article quotes David Ransohoff, professor of medicine at UNC Chapel Hill, saying this about Keith Baggerly.

I think Keith is doing a wonderful and needed job … But the fact that we need people like him means that our journals are failing us. The kinds of things that Keith spends time finding out — what [the researchers] actually do — that’s what methods and results are supposed to be for in journals. … We need to figure out how to do science without needing people like Keith.

One of the reasons for lack of reproducibility is that journals press authors for space and so statistics sections get abbreviated. (Why not put the full details online?) Another reason is that bioinformatics articles are inherently cross-disciplinary and it may be that no single person is responsible for or even understands the entire article.

Embedding .NET code in Office documents

I recently heard out about some interesting tools from Blue Reference. I haven’t had a chance to try them out yet, but they look promising.

Sweave has received a fair amount of attention with regard to reproducibility because it lets you embed R code in LaTeX. Code stays with the presentation document, reducing the chance of error and increasing transparency. However, the number of people who use R and LaTeX is small, and asking people to learn these two packages before they can reproducible research is not going to fly. The number of people who use C# and Microsoft Word is orders of magnitude larger than the number of folks who use R and LaTeX.

It looks like Blue Reference’s product Inference for .NET lets .NET programmers do the kinds of things Sweave lets R programmers do, embedding .NET code in Microsoft Office documents. The also make a product Inference for MATLAB for embedding MATLAB code in Office documents.

Python developers who don’t think of themselves as .NET developers might want to use Inference for .NET to embed Python code in Word documents via Iron Python. Ruby developers might want to use Iron Ruby similarly.

Programming is understanding

Bjarne Stroustrup’s book The C++ Programming Language begins with the quote

Programming is understanding.

Many times I’ve thought I understood something until I tried to implement it in software. Then the process of writing and testing the software exposed my lack of understanding

One thing that can make reproducible research difficult is that you have to deeply understand what you’re doing. Making work reproducible may require automating steps that you do not fully understand, and don’t realize that you don’t understand until you try automating them.

Stated more positively, attempts to make research reproducible can lead to new insights into the research itself.

Related post: Paper doesn’t abort