Sweave has been discussed here many times, but here’s a brief description for those just joining the discussion. Sweave is a tool for embedding R code inside LaTeX files, analogous to the way web development languages such as PHP or GCI embed scripting code in HTML. When you compile an Sweave file, the R code executes and the results (and optionally the source code) are inserted into the LaTeX output.
Sweave has the potential to make statistical analyses more reproducible. But I doubt many realize its vulnerabilities. The Sweave files are likely to have implicit dependencies on R session state or data located outside the file. You don’t really know that the output is reproducible until it’s compiled by someone else in a fresh environment.
My proposal is a service that lets you submit an Sweave file and get back the resulting LaTeX and PDF output. An extension to this would allow users to also upload data files along with their Sweave file so not all data would have to be in the Sweave file itself. For good measure, there should be some checksums to certify just what input went into producing the output.
Here’s one way I see this being used. Suppose you’re about to put a project on the shelf for a while. For example, you’re about to submit a paper to a journal.You may need to come back and make changes six months later. You think about the difficulty you’ve had in the past with these sorts of edits and want to make sure it doesn’t happen again. So you submit your Sweave document to the build server to verify that it is self-contained.
Here’s another scenario. Suppose you’ve asked someone whom you supervise to produce a report. Instead of letting them give you a PDF, you might insist they give you an Sweave file that you then run through the build service to make your own PDF. That way you can have the whole “but it works on my machine” discussion now rather than having it months later after the person who make the report has a new computer or a new job.
I just found out that Keith Baggerly will be speaking at Rice University this afternoon. His talk is entitled “Cell Lines, Microarrays, Drugs and Disease: Trying to Predict Response to Chemotherapy.” Here is part of the seminar announcement most relevant to reproducibility.
In this talk, we will describe how we have analyzed the data, and the implications of the ambiguities for the clinical findings. We will also describe methods for making such analyses more reproducible, so that progress can be made more steadily.
The talk will be at 4 PM in Keck Hall room 102.
The current issue of Computing in Science and Engineering (CiSE) is a special issue on reproducible research, edited by two pioneers in the field: Jon Claerbout and Sergey Fomel. They have assembled a great set of articles from experts with a lot of first-hand, personal reproducible research experience, so I would highly recommend this to my colleague researchers!
I got a pointer earlier this week to a New York Times article about R. A very interesting article about the use of R in scientific communities and industrial research, mainly for statistical analysis. R is open source software, so it is free and has already taken advantage from contributions made by various authors. And (although I haven’t used it myself yet), it is a great tool for reproducible research. Using the package Sweave, authors can write a single document containing their article and the R code to reproduce the results and put them in place. This ensures that all the material is in a single place.
It also shows something about the amazing power of open source software developed by a community of authors (and typically users at the same time).
Michael Neilsen posted an excellent article this morning Three myths of scientific peer review. He points out that peer review has only become common in the last 40 or 50 years. Maybe a few years from now someone will write an article looking back at how reproducible research came to be de rigueur. No one questions whether peer review is a good thing, though many people have complaints about the current system and argue about ways to make it better. Maybe the same will be said for reproducible research some day.
When I was in college, a friend of mine told me he liked to take his code out for a walk every now and then. By that he meant recompiling and running all of his programs. At the time I though that was unnecessary. If a program compiled and ran the last time you touched it, why shouldn’t it compile and run now? He simply said that I might be surprised.
Even when your source code isn’t changing, the environment around it is changing. When I was in college, computers didn’t have automatic weekly updates, but they changed often enough that taking your code out for a walk now and then made sense. Now it makes even more sense. See Jon Claerbout’s story along these lines.
The following articles on RR are included.
Guest Editors’ Introduction: Reproducible Research
Sergey Fomel, University of Texas at Austin
Jon F. Claerbout, Stanford University
Reproducible Research in Computational Harmonic Analysis
David L. Donoho, Stanford University
Arian Maleki, Stanford University
Inam Ur Rahman, Apple Computer
Morteza Shahram, Stanford University
Victoria Stodden, Harvard University
Python Tools for Reproducible Research on Hyperbolic Problems
Randall J. LeVeque, University of Washington
Distributed Reproducible Research Using Cached Computations
Roger D. Peng, Johns Hopkins Bloomberg School of Public Health
Sandrah P. Eckel, Johns Hopkins Bloomberg School of Public Health
The Legal Framework for Reproducible Scientific Research: Licensing and Copyright
Victoria Stodden, Harvard University
I seem to be dwelling quite some time on the web lately… After my post about the lifetime of URLs, here’s one about domain names and reproducibility. I recently noticed when looking around that there are quite some websites and domain names related to reproducible research.
reproducibleresearch.org is an overview website by John D. Cook containing links to reproducible research projects, articles about the topics, and relevant tools. It also contains a blog about reproducible ideas.
reproducibleresearch.com is owned by the people at Blue Reference, who created Inference for Office, a commercial tool to perform reproducible research from within Microsoft Office.
reproducibility.org is used by Sergey Fomel and his colleagues as home for their Madagascar open source package for reproducible research experiments.
reproducible.org is a reproducible research archive maintained by R. Peng at Johns Hopkins School, where the goal is to host a place for reproducible research packages.
Quite a range of domain names containing the word “reproducible” (or a derivative), if you ask me! And then I didn’t even start about the Open Research or Research 2.0 sites. Let’s hope this also means that research itself will soon see a big boost in reproducibility!