ActivePapers is a technology developed by Konrad Hinsen to store code, data and documentation with several benefits: storage in a single HDF5 file, internal provenance tracking (what code created what data/figure, with a Make-like conditional execution) and a containerized execution environment.
Implementations for the JVM and for Python are provided by the author. In this article, I go over the first steps of creating an ActivePaper. Being a regular user of Python, I cover only this language.
An overview of ActivePapers
First, a "statement of fact": An ActivePaper is a HDF5 file. That is, it is a binary, self-describing, structured and portable file whose content can be explored with generic tools provided by the HDF Group.
The ActivePapers project is developed by Konrad Hinsen as a vehicle for the publication of computational work. This description is a bit short and does not convey the depth that has gone into the design of ActivePapers, the ActivePapers paper will provide more information.
ActivePapers come, by design, with restrictions on the code that is executed. For instance, only Python code (in the Python implementation) can be used, with the scientific computing module NumPy. All data is accessed via the h5py module. The goals behind these design choices are related to security and to a good definition of the execution environment of the code.
Creating an ActivePaper
The tutorial on the ActivePapers website start by looking at an existing
ActivePaper. I'll go the other way around, as I found it more intuitive. Interactions with
an ActivePaper are channeled by the
aptool program (see the
Currently, ActivePapers lack a "hello, world" program, so here is mine. ActivePapers work best when you dedicate a directory to a single ActivePaper. You may enter the following in a terminal:
mkdir hello_world_ap # create a new directory cd hello_world_ap # visit it aptool -p hello_world.ap create # This lines create a new file "hello_world.ap" mkdir code # create the "code" directory where you can # write program that will be stored in the AP echo "print 'hello, world'" > code/hello.py # create a program aptool checkin -t calclet code/hello.py # store the program in the AP
That's is, you have created an ActivePaper!
You can observe its content by issuing
aptool ls # inspect the AP
And execute it
aptool run hello # run the program in "code/hello.py"
This command looks into the ActivePapers file and not into the directories visible in the filesystem. The filesystem acts more like a staging area.
A basic computation in ActivePapers
The "hello, world" program above did not perform a computation of any kind. An introductory example for science is the computation of the number $\pi$ by the Monte Carlo method.
I will now create a new ActivePaper (AP) but comment on the specific ways to define parameters, store data and create plots. The dependency on the plotting library matplotlib has to be given when creating the ActivePaper:
mkdir pi_ap cd pi_ap aptool -p pi.ap create -d matplotlib
To generate a repeatable result, I store the seed for the random number generator
aptool set seed 1780812262 aptool set N 10000
The line above store a data element in the AP that is of type integer. The value of
can be accessed in the Python code of the AP.
I will create several programs to mimic the workflow of more complex problems: one to generate the data, one to analyze the data and one for generating a figure.
The first program is
import numpy as np from activepapers.contents import data seed = data['seed'][()] N = data['N'][()] np.random.seed(seed) data['random_numbers'] = np.random.random(size=(N, 2))
Apart from importing the NumPy module, I have also imported the ActivePapers
from activepapers.contents import data
data is a dict-like interface to the content of the ActivePaper and so only work in code
that is checked in the ActivePaper and executed with
data can be used to read
values, such a the seed and number of samples, and to store data, such as the samples here.
The second program is
import numpy as np from activepapers.contents import data xy = data['random_numbers'][...] radius_square = np.sum(xy**2, axis=1) N = len(radius_square) data['estimator'] = np.cumsum(radius_square < 1) * 4 / np.linspace(1, N, N)
And the third is
import numpy as np import matplotlib matplotlib.use('PDF') import matplotlib.pyplot as plt from activepapers.contents import data, open_documentation estimator = data['estimator'] N = len(estimator) plt.plot(estimator) plt.xlabel('Number of samples') plt.ylabel(r'Estimation of $\pi$') plt.savefig(open_documentation('pi_figure.pdf', 'w'))
- The setting of the
- The use of
open_documentation. This function provides file descriptors that can read and write binary blurbs.
Now, you can checkin and run the code
aptool checkin -t calclet code/*.py aptool run generate_random_numbers aptool run compute_pi aptool run plot_pi
That's it, we have created an ActivePaper and ran code with it.
For fun: issue the command
aptool set seed 1780812263
(or any number of your choosing that is different from the previous one) and then
ActivePapers handle dependencies! That's is, everything that depends on the seed will be updated. That include the random numbers, the estimator for pi and the figure. To see the update, check the creation times in the ActivePaper
aptool ls -l
It is good to know that ActivePapers have been used as companions to research articles! See Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format for instance.
You can have a look at the resulting files that I uploaded to Zenodo: doi:10.5281/zenodo.55268
ActivePapers paper K. Hinsen, ActivePapers: a platform for publishing and archiving computer-aided research, F1000Research (2015), 3 289.
ActivePapers website The website for ActivePapers