Category Archives: ideas, comments,…

Where do you want to share?

Given the fact that you are reading this blog, I am assuming you are into sharing your code and data. However, what is not so clear to me, is where we should be sharing these data (assuming data includes code, which people often seem to forget).

  • On our own personal web page? Seems like an excellent place to me, but too many of those personal sites are extremely short-lived. I notice this every time I update links.
  • On our institution’s publication pages? That would probably be my preferred choice at this moment. Often with a longer life-span than personal webpages, and still “close enough to yourself”. Some issues arise with people working at a company (but then, are you often allowed to share code/data in such a situation?), or with people moving from one institution to another, but those all seem fairly limited compared to some of the alternatives.
  • On the publisher’s web pages? That would make it consistent with the related publication. However, I’m not sure I want to transfer ownership of my code and data to the publisher as well.
  • On “social media” such as ResearchGate, or Academia.edu? At first I was enthusiast about these, but I start having my doubts. Who is behind these sites? How are they counting on making money, based on my data? Now that some of those start spamming me with e-mail, and asking me whether I have questions for the authors of a paper I downloaded, I become even more skeptical.
  • Any other suggestions?

Maybe I am too critical about this, or too old-fashioned. Or just too commercially oriented, and not open enough to share with everyone potentially interested in my work. Who will tell?

Elsevier Executable Paper Grand Challenge

At two recent occasions, I heard about Elsevier’s Executable Paper contest. The intention was to show concepts for the next generation of publications. Or as Elsevier put it:

Executable Paper Grand Challenge is a contest created to improve the way scientific information is communicated and used.

It asks:

  • How can we develop a model for executable files that is compatible with the user’s operating system and architecture and adaptable to future systems?
  • How do we manage very large file sizes?
  • How do we validate data and code, and decrease the reviewer’s workload?
  • How to support registering and tracking of actions taken on the ‘executable paper?’

By now, the contest is over, and the winners have been announced:

First Prize: The Collage Authoring Environment by Nowakowski et al.
Second Prize: SHARE: a web portal for creating and sharing executable research papers by Van Gorp and Mazanek.
Third Prize: A Universal Identifier for Computational Results by Gavish and Donoho.

Congratulations to all! At the AMP Workshop where I am now, we were lucky to have a presentation about the work by Gavish and Donoho, which sounds very cool! I also know the work by Van Gorp and Mazanek, using virtual machines to allow others to reproduce results. Still need to look into the winner’s work…

If any of this sounds interesting to you, and I believe it should, please take a look at the Grand Challenge website, and also check out some of the other participants’ contributions!

Here at the workshop, we also had an interesting related presentation yesterday by James Quirk about all that can be done with a PDF. Quite impressive! For examples, see his Amrita work and webpage.

What is the best reproducible research?

What is best research practice in terms of reproducibility? At the recent workshop in As (Norway), I had a discussion with Marc-Oliver Gewaltig, similar to discussions I had earlier with some other colleagues as well. So I decided to put it up here. All feedback is welcome!

The discussion boils down to the following question: Is it better (in terms of reproducibility) to make code and data available online and allow users to repeat your experiments (or simulations as Marc-Oliver would call them) obtaining the same results, or to describe your theory (model in Marc-Oliver’s terminology) in sufficient detail that people can verify your results by re-implementing your experiments and verifying that they obtain the same thing?

I personally believe both approaches have their pros and cons. With the first one, a reader can download the related code and data, and very easily verify that he/she can obtain the same results as presented in the paper. If he wants to analyze things further, there is already a first implementation available to start analyzing, or to test on other data. However, that certainly doesn’t take away the need for a good and clear description in the paper!

With the second approach, one avoids the risk that a bug in the code giving those results is not caught by a reader reproducing the results, because he can just “double-click” to repeat the experiment. The second approach allows a thorough verification of the presented concept/theory, as the reader independently re-implements the work and checks the results. I believe certain standardization bodies like MPEG use this approach to make sure that descriptions are sufficiently precise.

Personally, I think the second approach is a better, more thorough approach in an ideal world. Currently, I prefer the first one, because most people won’t go into the depth of re-implementing things, and the first approach already gives those people something. Something more than just the paper, allowing to get their hands dirty on it. And “more interested readers” may still re-implement, or start analyzing the code in detail.

On doing research

I was just reading the following two articles/notes. While they are not entirely about reproducible research, I think they reflect well the worries that many researchers have about current “publish or perish” research practices. Not sure I agree with all of it, but they do make a number of good remarks.

D. Geman, Ten Reasons Why Conference Papers Should be Abolished, Johns Hopkins University, Nov. 2007.

Y. Ma, Warning Signs of Bogus Progress in Research in an Age of Rich Computation and Information, ECE, University of Illinois, Nov. 2007.


Climate science

Just like many other domains, climate science is a mixture between theory, models and empirical results. Often this comes with different scientists working on the different parts (theory/model/experiments), and all claiming their part to be the (far) more important one of the three. A nice analysis is given on the IEEE Spectrum site. Unlike many other domains, it seems hard to me (not being a climate scientist) to do a lot of small experiments to validate the models. This makes it even more important to be open about the precise models used, parameters, and the data used to validate those models.

We’ve only got one planet Earth to validate models on. And it takes soooo long to check whether a model is correct, that we’d better be open about it, collaborate, check each other’s assumptions, and make sure it’s the best model we can make!

For some more discussion on the recent climate study scandal and reproducible research, see also Victoria Stodden’s blog (or also here).

ORCID: on being a number

I just learned about ORCID: the Open Researcher Contributor Identification initiative. Its goal is to provide a unique ID for every researcher, and in that way provide better traceability of all the work by a researcher. It should avoid ambiguity between authors with the same name and typos. They even intend to include not only ‘standard’ conference/journal publications, but also more ‘exotic’ research output like data sets, blog posts, etc. The initiative is supported by a large number of major publishers, like Springer, Elsevier and Nature.

A very nice initiative, which should get a few problems out of the world. However, I am not sure how that is supposed to work in practice. Does that mean that we should soon add an ORCID number (without typos) below the title and the author name? And cite works by citing the ORCID and the DOI (digital object identifier)? And will we write these numbers with less errors than the author names now?

It makes me indeed think of that other unique number: DOI, which was introduced to uniquely identify a document (publication, for as far as I have seen them). I’ve seen it for some time now when I look up articles, and I have no doubt it uniquely identifies those articles, but what is it used for? Maybe they have their use… but I haven’t seen it yet.

People who do know of practical cases where the DOI is used, feel free to comment! (others too, of course)

repository server for publications

I think it’s probably a lot easier, and more consistent, if instead of making a web page for each RR paper we do (http://lcavwww.epfl.ch/reproducible_research), we have a setup (a bit) like Infoscience, where everyone can enter publications by filling in the required and optional fields. I would like to build such a setup based on EPrints (http://www.eprints.org/software/) and make it public, such that other labs/universities can also easily set up a similar server. We will probably let the people from EPrints develop this system, but for that we need accurate requirements… So your comments on this would be very welcome!

I was thinking about the following fields:
– standard publication fields (title, author, reviewing status, journal, volume, number, pages, year, DOI, abstract, keywords, PDF, publisher, official URL)
– specifically for RR:
* code and data (in a zip archive, specifying also the type of code), mandatory
* tested configurations, mandatory
* contact e-mail address, mandatory
* figures, optional
– additional features for readers (cfr http://clare.eprints.org/10/ for an example of the last)
* a check box saying ‘I have tested this code and it runs/does not run’
* a check box saying ‘I was/was not able to reproduce the results described in this paper’
* a field where anyone can add comments

Any comments? More/less things needed?
Some specific questions:
– should we make these ‘Additional features’ linked to a name and/or date or so, such that we can avoid the author clicking 10 times? 😉
– should we separate code and data? Data might get quite large, while code is generally small.