Fixing scientific software distribution: October 2011

Matt T

My name is Matt Turk, and I'm a computational astrophysicist working on structure formation in the early universe. I am the original author of the yt code and a developer on the Enzo project. I'm at Columbia University, on an NSF postdoctoral fellowship to drive forward my studies of the First Stars while developing infrastructure for these simulations, targeting simulations from laptop- to peta-scale. yt is a python project, while Enzo is largely C/C++/Fortran. yt is designed to target the output of multiple different simulation codes, and has a growing user- and developer-base.

My primary interests with respect to software are ensuring that communities of users can deploy appropriate analysis packages easily and on multiple systems. While the majority of the users of yt utilize XSEDE resources, we have a large number that also use laptops and local computing clusters.

yt started out as very, very difficult to install. The software stack was quite large and it was not automated. For the most part, we have addressed this in two ways. The first is that the dependency stack has been whittled away substantially; we are extremely conservative in adding new dependencies to yt, and the core dependencies for most simulation input types is simply numpy, hdf5, and Python itself. The second approach is to provide a hand-written installer script, which handles installation of the following dependencies into an isolated directory structure:

zlib
bzlib
libpng
freetype (optional)
sqlite (optional)
Python
numpy
matplotlib (optional)
ipython (optional)
hdf5
h5py (optional)
Cython (optional)
Forthon (optional)
mercurial
yt

This seems like a large stack, but the trickiest libraries are usually matplotlib and numpy. We have also reached out to XSEDE and modules are now available on several HPC installations. The install script takes care of the rest. We're currently in the process of attempting to make yt available as a component of both ParaView's superbuild and VisIt's build_visit script, both of which also handle dependency stacks. I'm extremely concerned with ensuring that yt's installation works everywhere, especially those systems where root / sudo is not available.

Easily the hardest problem, and the one that I hope we can solve in some way, is that of static builds. The problem of building a stack library (for use, for instance, on Compute Node Linux on some Cray systems) is difficult; starting with the GPAW instructions we at one time attempted to maintain static builds of yt, but the inclusion of C++ components (and the lack of C++ ABI interoperability) became too much of a burden and we no longer do so. Now we are faced with the issue of needing one because file systems typically cannot keep up with importing a python stack from every MPI task (which becomes burdensome even at as few as 256 processes, and essentially impossible above a couple thousand). While egg imports and zipped file systems alleviate this problem for pure-python libraries, this will not work for shared libraries. Neither I nor my fellow developers have found a simple and easy way to generate static builds that are easily updated, but this is a primary concern for me.

I don't have a particular takeaway or suggestion for a call to action; we have lately simply come to terms with the time it takes to load shared libraries, and we'll probably have another go at a unified static builder at some point in the future. But for now, out install script works reasonably well, and we will probably continue using it while still reaching out to system administrators for assistance building on individual supercomputer installations.

Qsnake

Here is my answer to all the questions from the Introduction post.

My name is Ondřej Čertík and I am doing my PhD in Chemical Physics at University of Nevada, Reno. For my work, I need C/C++ libraries (like PETSc and Trilinos) as well as Fortran libraries (BLAS, LAPACK, ARPACK and some FEM packages). My own code used to be in C++, but last year I switched to Fortran (so I need Fortran as the first class citizen). I then wrap it using Cython and call it from Python.

I took Sage, and rewrote the build system in Python, and created Qsnake (the core is BSD licensed, other packages have their own license). The packages are hosted at github, and it uses a json file with package dependencies. Quite a few people have tried it already and here is our plan with the package management. I need a lot of packages, I would call it engineering packages, mainly around the SciPy stack, and a few more numeric packages.

I don't think there is any fundamental problem with my approach, it works great for me and does exactly what I need. Obviously more improvements would be cool, see the packages plan, and I work on it as my time allows.

My key insight is that it is a lot of work to get things working and testing on various platforms (Linux, Mac so far). Also it is advantageous to keep packages compatible with Sage, because then people can reuse them and instead of creating yet another fork, I view Qsnake as a complement to Sage.

The way forward that I can see is to simply continue working on Qsnake (in my case), or if I can see significant progress on some competing product, I might join it. I am working on Qsnake. I don't have much money to spend on it, but thanks to being compatible with Sage, there is possibility of common workshop with Sage (I got the offer from William Stein to organize such a thing, but I am currently too busy). I use Qsnake almost daily for my own work, and I improve it as time permits. I don't have much time in general though, but I do my best.

Finally, I think that the most important insight that I have learned is that it is important to get and discuss good design and so on, but it is even more important to simply start working on something and make it fix a problem for somebody (me in my case). Qsnake does exactly that, and as I said, I am open to adjust it's goal if anyone wants to join, as well as join any competing project, if somebody else think he can do (and does) a better job (which shouldn't be that difficult given my own time constrains). In the meantime, I simply continue with Qsnake.

What now?

This blog was hardly a success (though thanks to those who did post!). Quite a few people said they wanted to post and quite a few more read the blog (800 page views). So at least it's obvious there's a lot of interest in this subject.

What now? Ideas? (Beyond just giving up...)

If you really did intend to put aside time to make a post, please do it now. Don't worry too much about quality at this point... in particular those who sent me emails in private, feel free to just dump them to the blog.
We could announce a migration to a mailing list (python-hpc? Or a new Google group?) and promise to keep each other posted on our individual efforts there. Though I'm not sure whether starting a mailing list is really easier than a blog, many mailing lists die out too.
I did throw out the idea of a Skype meeting. If anyone would prefer that please say so. Probably more time-consuming overall than getting the blog to work, but perhaps more comfortable and easier to get people to commit...

Fixing scientific software distribution

Blog Archive

2011-10-02

Matt T

Qsnake

What now?