Pierre de Buyl's homepage

License: CC-BY

ActivePapers is a technology developed by Konrad Hinsen to store code, data and documentation with several benefits: storage in a single HDF5 file, internal provenance tracking (what code created what data/figure, with a Make-like conditional execution) and a containerized execution environment.

Implementations for the JVM and for Python are provided by the author. In this article, I go over the first steps of creating an ActivePaper. Being a regular user of Python, I cover only this language.

An overview of ActivePapers

First, a "statement of fact": An ActivePaper is a HDF5 file. That is, it is a binary, self-describing, structured and portable file whose content can be explored with generic tools provided by the HDF Group.

The ActivePapers project is developed by Konrad Hinsen as a vehicle for the publication of computational work. This description is a bit short and does not convey the depth that has gone into the design of ActivePapers, the ActivePapers paper will provide more information.

ActivePapers come, by design, with restrictions on the code that is executed. For instance, only Python code (in the Python implementation) can be used, with the scientific computing module NumPy. All data is accessed via the h5py module. The goals behind these design choices are related to security and to a good definition of the execution environment of the code.

Creating an ActivePaper

The tutorial on the ActivePapers website start by looking at an existing ActivePaper. I'll go the other way around, as I found it more intuitive. Interactions with an ActivePaper are channeled by the aptool program (see the installation notes).

Currently, ActivePapers lack a "hello, world" program, so here is mine. ActivePapers work best when you dedicate a directory to a single ActivePaper. You may enter the following in a terminal:

mkdir hello_world_ap                 # create a new directory
cd hello_world_ap                    # visit it
aptool -p hello_world.ap create      # This lines create a new file "hello_world.ap"
mkdir code                           # create the "code" directory where you can
                                     # write program that will be stored in the AP
echo "print 'hello, world'" > code/hello.py # create a program
aptool checkin -t calclet code/hello.py     # store the program in the AP

That's is, you have created an ActivePaper!

You can observe its content by issuing

aptool ls                            # inspect the AP

And execute it

aptool run hello                     # run the program in "code/hello.py"

This command looks into the ActivePapers file and not into the directories visible in the filesystem. The filesystem acts more like a staging area.

A basic computation in ActivePapers

The "hello, world" program above did not perform a computation of any kind. An introductory example for science is the computation of the number $\pi$ by the Monte Carlo method.

I will now create a new ActivePaper (AP) but comment on the specific ways to define parameters, store data and create plots. The dependency on the plotting library matplotlib has to be given when creating the ActivePaper:

mkdir pi_ap
cd pi_ap
aptool -p pi.ap create -d matplotlib

To generate a repeatable result, I store the seed for the random number generator

aptool set seed 1780812262
aptool set N 10000

The line above store a data element in the AP that is of type integer. The value of seed can be accessed in the Python code of the AP.

I will create several programs to mimic the workflow of more complex problems: one to generate the data, one to analyze the data and one for generating a figure.

The first program is generate_random_numbers.py

import numpy as np
from activepapers.contents import data

seed = data['seed'][()]
N = data['N'][()]   
np.random.seed(seed)
data['random_numbers'] = np.random.random(size=(N, 2))

Apart from importing the NumPy module, I have also imported the ActivePapers data

from activepapers.contents import data

data is a dict-like interface to the content of the ActivePaper and so only work in code that is checked in the ActivePaper and executed with aptool. data can be used to read values, such a the seed and number of samples, and to store data, such as the samples here.

The [()] returns the value of scalar datasets in HDF5. To have more information on this, see the dataset documentation of h5py.

The second program is compute_pi.py

import numpy as np
from activepapers.contents import data

xy = data['random_numbers'][...]
radius_square = np.sum(xy**2, axis=1)
N = len(radius_square)
data['estimator'] = np.cumsum(radius_square < 1) * 4 / np.linspace(1, N, N)

And the third is plot_pi.py

import numpy as np
import matplotlib
matplotlib.use('PDF')
import matplotlib.pyplot as plt
from activepapers.contents import data, open_documentation

estimator = data['estimator']
N = len(estimator)
plt.plot(estimator)
plt.xlabel('Number of samples')
plt.ylabel(r'Estimation of $\pi$')
plt.savefig(open_documentation('pi_figure.pdf', 'w'))

Notice:

The setting of the PDF driver for matplotlib before importing matplotlib.pyplot.
The use of open_documentation. This function provides file descriptors that can read and write binary blurbs.

Now, you can checkin and run the code

aptool checkin -t calclet code/*.py
aptool run generate_random_numbers
aptool run compute_pi
aptool run plot_pi

Concluding words

That's it, we have created an ActivePaper and ran code with it.

For fun: issue the command

aptool set seed 1780812263

(or any number of your choosing that is different from the previous one) and then

aptool update

ActivePapers handle dependencies! That's is, everything that depends on the seed will be updated. That include the random numbers, the estimator for pi and the figure. To see the update, check the creation times in the ActivePaper

aptool ls -l

It is good to know that ActivePapers have been used as companions to research articles! See Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format for instance.

You can have a look at the resulting files that I uploaded to Zenodo: doi:10.5281/zenodo.55268

References

ActivePapers paper K. Hinsen, ActivePapers: a platform for publishing and archiving computer-aided research, F1000Research (2015), 3 289.

ActivePapers website The website for ActivePapers