Sweave has been discussed here many times, but here’s a brief description for those just joining the discussion. Sweave is a tool for embedding R code inside LaTeX files, analogous to the way web development languages such as PHP or GCI embed scripting code in HTML. When you compile an Sweave file, the R code executes and the results (and optionally the source code) are inserted into the LaTeX output.
Sweave has the potential to make statistical analyses more reproducible. But I doubt many realize its vulnerabilities. The Sweave files are likely to have implicit dependencies on R session state or data located outside the file. You don’t really know that the output is reproducible until it’s compiled by someone else in a fresh environment.
My proposal is a service that lets you submit an Sweave file and get back the resulting LaTeX and PDF output. An extension to this would allow users to also upload data files along with their Sweave file so not all data would have to be in the Sweave file itself. For good measure, there should be some checksums to certify just what input went into producing the output.
Here’s one way I see this being used. Suppose you’re about to put a project on the shelf for a while. For example, you’re about to submit a paper to a journal.You may need to come back and make changes six months later. You think about the difficulty you’ve had in the past with these sorts of edits and want to make sure it doesn’t happen again. So you submit your Sweave document to the build server to verify that it is self-contained.
Here’s another scenario. Suppose you’ve asked someone whom you supervise to produce a report. Instead of letting them give you a PDF, you might insist they give you an Sweave file that you then run through the build service to make your own PDF. That way you can have the whole “but it works on my machine” discussion now rather than having it months later after the person who make the report has a new computer or a new job.
It’s worth recalling that PDF was conceived as a container mechanism for streamlining document workflows. Therefore there is nothing to stop you asking for a self-replicating PDF. This is a PDF that contains everything needed to re-generate itself and the computational work it reports. In your case this would amount to an Sweave program embedding itself, and whatever data files it uses, into the PDF report it generates. The “build server” would then check the integrity of the report by extracting the embedded files and processing them to see if it could replicate the supplied PDF. Thus the “build server” is more of a “check server.” And only PDF’s which check out (i.e. can actually replicate themselves) would be allowed out into the wild.
I demonstrated a self-replicating PDF at Burton Wendroff’s 70th birthday celebration, March 8th 2000; see http://www-troja.fjfi.cvut.cz/~liska/bbw . My application was computational fluid dynamics, but the associated software plumbing is independent of the end-application and so if anyone is interested, I could easily put an Sweave example together?
Here it’s worth noting that PDF has come a very long way since 2000 and a new, exciting opportunity for reproducible research is appearing on the horizon. Specifically, it is now notionally possible to compile R using Adobe’s Alchemy SDK so as to produce a SWF file that could be run directly inside a PDF. I say notionally as Alchemy generated SWF’s require FlashPlayer 10 and Acroread 9 currently ships with Flash Player 9. Also, as Alchemy is still in its infancy it can easily be stumped by applications that work with shared objects.
But fast forwarding to 2018, I have every confidence that PDF reports will routinely house virtualized machines that allow the interested reader to sample the recorded work firsthand. The obvious market for such PDF’s will be electronic text books. But the associated infrastructure, as it matures, will be a tremendous boon for true “reproducible research.”