Last week Greg Wilson asked me what I would do if I had one hour to teach a group about reproducible research. He said to assume that the group is already convinced of the need for reproducibility.
First here are some thoughts on what I’d say if the group had not given much thought to reproducibility. I would start impersonal and then become more personal. I’d start by relating some horror stories of how someone else’s work was impossible to reproduce and contained false conclusions. It’s easy to gang up on some third party researcher, griping about how sloppy someone not in the room was in their research. This plants the idea that at least some people need to think more about reproducibility. Then I’d transition by talking about times when I’ve had difficulty reproducing my own work. Then I would try to convince them that their own work is probably not reproducible or at least not easily reproducible. So my outline would be they have problems, I have problems, you have problems.
I believe that convincing people of the need to be concerned about reproducibility is most of the problem. If people are highly motivated, they will come up with their own ways to make their work easier to reproduce and they will gladly take advantage of tools they are introduced to.
To Greg’s original question, now what? First I’d expound the merits of version control systems. You can’t possibly reproduce software if you can’t put your hands on the source code, and you can’t reproduce software as it existed at a particular point in time without revision history. Then I’d emphasize that version control is necessary but not sufficient. When people first understand version control, they tend to think it takes care of all their reproducibility problems when in fact it’s just the first step. I’d share some war stories of projects that have taken many hours to build even though we had all the source code. (If I had a semester rather than an hour, I’d let them experience this for themselves rather than just telling them about it by bringing in some outside projects for them to rebuild.) I’d also emphasize that it’s not enough to put code in version control: data needs to be versioned as well.
Once they grok version control, I’d discuss automation. When a process is 99% automated and 1% manual, the reproducibility problems come from the 1% that is manual. The principle behind many reproducibility tools is automating steps that are otherwise manual, undocumented, and error-prone. (See Programming the last mile.)
Finally, I’d emphasize the need for auditing. As I pointed out in an earlier post “You cannot say whether your own research is reproducible. It’s reproducible when someone else can reproduce it.” Again if I had a semester rather than an hour, I’d let them experience this by having them reproduce each other’s assignments. I could hear it now: “What do you mean you can’t reproduce my homework? It’s all right there!”