Fixing scientific software distribution: September 2011

2011-09-22

Experiences with Python under git Control

I find Dag's idea very straight and brave. I'm facing pretty much the same problem with Python only (i.e. no C libs, no Fortran libs, whatever). I do use mpl, though. Maybe you remember me doing some builds of numpy for the OS X users. That time I resolved the reproducibility and branchability problem using .. git. Yes, Python path under git control. I've put the whole Python installation directory, which is usually /Library/Frameworks/Python.framework/Version/2.x/ under a singleton git control (i.e. making this dir a repo). This works pretty flawless after some tinkering around. There are some pitfalls, which can be avoided, and some problems, which can be solved. The pitfalls will not be covered here. One of the "major" problems is the .pth file issue, when merging in branches of a different software which is installed via .pth, git sometimes cannot merge the .pth file properly and this needs to be fixed manually.

There is one more major problem where I would need feedback or have a lack of knowledge, that is the speed issue with .pyc being older than the corresponding .py file. The .pyc files in general are one of the pitfalls of the method. They are hard to repoduce to look the same as when installed. And you cannot git-ignore them because .pyc's lingering around can, for instance, disturb nose. For this problem, which can be ignored at all if load speed does not matter, I didn't find a real solution so far; I simply ignored it and kept the .pyc's in the repo.

I'm going to use this approach for my OS X Lion Python installation with a shared system Python and User-local Python (probably via virtualenv), because I want the System Python to be free of user's software, let it be software interesting to others or not. I've worked out the strategy for handling mpkg (i.e. point-and-click installers), by fetching from the system Python and deleting the branch in the system Python afterwards. Anyway, this is planned, not yet tested. Tested is only the strategy for handling and maintaining the system Python, and this I was used to use on a regular basis.

Of course, it would be possible to wrap git in a Python-specific application, but that would hinder portability to the C, Cython, Fortran etc. folks amongst us.

I must add that I don't have experience with most of the packaging software that is around except the Python distutils & Bento (where I clearly prefer the latter).

To my experience there will not be the solution. World is not that simple as we scientists would like it to be. It's like a zoo out there [from a well-known movie]. Nothing will ever help this. Life would be boring if everything would be standardised and there would be only 1 standard. In my opinion, a discussion serves as .. discussion on its own, i.e. propelling the mind by developing new ideas. Standards might be a side-effect of a good democratic evolution. So this round table might have perfect conditions when it brings together the different parties, which were separated in mind before.

2011-09-20

Chris Kees

Introducing myself

I am a research hydraulic engineer at the US Army Engineer Research & Development Center (ERDC) and one of the developers of Proteus, a Python toolkit for computational methods and simulation. I've been working in the area of numerical methods for partial differential equations and high performance computing since 1995.

In the interest of contributing to the roundtable discussion in a timely manner, I'm going to basically post what I put on the sage-support list. I have learned a lot since that post, and I see a lot of good ideas in what others have shared about their knowledge of linux packaging systems like Nix and gentoo.

One area where I have a slightly different opinion is that I think we should focus on just the needs of the Python environment on HPC systems. That includes the difficulties of working with many other packages and system libraries, but I am looking for an evolutionary step in what we currently have working for our Python software. If the resulting python distribution solves the more general problem then so be it.

The way I see forward (from sage-support)

Here's what I think we need:

1) A standard, which specifies a python version, and a list of python packages and their dependent packages. This allows for-profit vendors to build to our standard.

2) A build system that allows extensive configuration of the entire system but with enough granularity that the format of a package is standardized and relatively straightforward. On the other hand, the whole system must be designed such that it can be built repeatedly from scratch without any interactive steps.

3) A testing system that is simple enough that the community can easily contribute tests to ensure that the community python is reliable for their needs

4) A framework for making this environment extensible without requiring forking it and creating yet more distributions

Here's a straw man:

1) Standard:

Python 2.7.2 PLUS:

numpy *

scipy

matplotlib *

vtk (python wrappers + C++ libs) *

elementtree *

ctypes *

readline (i.e. a functional readline extension module) *

swig

mpi4py *

petsc4py *

pympi

nose *

pytables *

basemap

cython *

sympy *

pycuda

pyopencl

IPython *

wxpython

PyQt *

pygtk

PyTrilinos

virtualenv *

Pandas
numexpr *
pygrib

Note:

*Our group has these in the python stack we build for our PDE solver framework (http://proteus.usace.army.mil), which we build on a range of machines at 4 major supercomputing centers.

The main issue I see with 1) is that this is somewhat different from the sage package list. We would need many optional sage packages but wouldn't need some of the standard sage packages.

2) Build System:

a. Use cmake* for the top level configuration, storing the part relevant for each package in a subdirectory for each package (call it package_name_Config e.g. numpyConfig, petsc4pyConfig, ...)

b. store each package as an spkg** that meets sage community standards except that spkg-install will rely on information from package_name_Config (maybe it would be OK to edit files in package_name_Config located INSIDE package_name_version.spkg during the interactive configuration step?)

c. each package will still get built with it's native built system***

Notes:

*Our group simply uses make instead of cmake, with a top level Makefile containing 'editConfig' and 'newConfig' targets that allows you to edit and copy existing configurations

**Our group only produces a top level spkg, but I think we could easily generate a finer grained set of spkg's for ones that don't already exist

***Our group does this (i.e. we don't rewrite upstream build systems). I think spkg's also use the native build system in most cases, right?

The main issue with 2. (the build system) is that building on HPC systems requires extensive configuration of individual packages: numpy needs to get built with the right vendor blas/lapack and potentially the correct, non-gcc, optimizing compilers (maybe even a mixture of gcc and some vendor fortran). Likewise petsc4py might need to use PETSc libraries installed as part of the HPC baseline configuration rather than building the source included with this distribution. My impression is that sage very reasonably opted to focus on the web notebook and a gnu-based linux environment so the spkg system alone doesn't fully meet the needs of the HPC community. We need the ability to specify different compilers for different packages and to do a range of things from building every dependency to building only python wrappers for many dependencies.

3) buildbot + nose and a package_nameTest directory for community supplied tests of each package in addition to the packages' own tests. This way users only have to add test_NAME.py files to

4) virtualenv + pip should allow users to extend the python installation into a their private environment where they can update and add new packages as necessary. An issue here is that it wouldn't allow a per-user sage environment so I'm not sure whether users could also install spkg's or even use their modified python environment from sage.

Dag Sverre Seljebotn

Introducing myself

I'm a Ph. D. student doing statistical analysis of the cosmic microwave background (Institute of
Theoretical Astrophysics, Oslo, Norway). This is a very Fortran-oriented place, with only a couple of Python developers.

I'm one of the developers of Cython. I also worked for some months helping Enthought to port SciPy to .NET.

In a couple of months the institute will likely be buying a small cluster (~700 cores). The upside is we don't have to use a certain ever-failing cluster (which will remain unnamed) nearly so much. The downside is we need to build all those libraries yet again.

My Ph. D. project is to rewrite an existing code so that it can scale up to higher resolutions than today (including work on statistical methods and preconditioners). My responsibility will be the computationally heavy part of the program. The current version is 10k lines of Fortran code, the rewritten version will likely be a mix of Python, Cython, C and Fortran. MPI code with many dependencies: Libraries for dense and sparse linear algebra, Fourier transforms, and
spherical harmonic transforms.

What I have tried

During my M. Sc. I relied on a Sage install:

It got so heavily patched with manually installed packages that I never dared upgrade
matplotlib was botched and needed configuration + rebuild (no GUI support)
NumPy was built with ATLAS, which produced erronous results on my machine, and I made the NumPy SPKG work with Intel MKL
I needed to work both on my own computer and the cluster, and keep my heavily patched setups somewhat consistent. I started writing SPKGs to do this, but it was more pain than gain
I still ended up with a significant body of C and Fortran code in $HOME/local, and Python code in $PYTHONPATH.

In the end I started afresh with EPD, simply because the set of packages that I wanted was larger and the set of packages I didn't want smaller. My current setup is a long $PYTHONPATH of the code I modify, and EPD + manually easy_install'ed packages. There are probably better ways of using EPD, but I don't want to invest in something which only solves one half of my problems.

In search of a better solution I've tried Gentoo Prefix and Nix. Both of these were a bit difficult to get started with (Nix much better though), and also assumed that you wanted to build everything, including gcc and libc. Building its own libc makes it incompatible with any shared libraries on the "host" machine, so it's an all-or-nothing approach, and I didn't dare make the commitment.

None of the popular solutions solve my problems. They work great for the target communities --Mathematicians in the case of Sage, scientists not using clusters or GPLed code in the case of EPD -- but nobody has a system for "PhD students who uses a software stack that cluster admins have barely heard of, C/Fortran libraries that the Python community has never heard of, and need to live on bleeding edge on some components (Cython, NumPy) but cares less about the bleeding edge of other components (matplotlib, ipython)".

Build-wise: Building Cython+Fortran used to be a pain with distutils. I switched to SCons, which was slightly better, but had its own problems. Finally, the current waf works nicely (thanks to the work of David Cournapeau and Kurt Smith), so I switched to that. Bento sounds nice but I didn't use it yet since I just use PYTHONPATH for my own code and didn't need to distribute code to others than co-workers yet.

My insights

The problem we face is not unique to Python (though perhaps made worse by people actually starting to reuse each others code...). A solution should target scientific software in general.
Many Python libraries wraps C/Fortran code which must also be properly installed. Bundling C/Fortran libraries in a Python package (as some do) is a pain, because you can no longer freely upgrade or play with compilation of the C/Fortran part.
Non-root use is absolutely mandatory. I certainly won't get root access on the clusters, and the sysadmins can't be bothered to keep the software stack I want to use up to date.
I think all popular solutions fall short of allowing me the flexibility that Git offers me with respect to branching and rollbacks. I want to develop different applications on top of different software stacks, to switch to a stack I used a year ago for comparison (reproducible research and all that), and to more easily hop between stacks compiled with different versions of BLAS.
I like to use my laptop, not just the cluster. Relying on a shared filesystem or hard-coded paths is not good.
I want it to be trivial to use the software distribution system to distribute my own code to my co-workers. I don't want to invent a system on top for moving around and keeping in sync the code I write myself.

The way I see forward

Before daring to spend another second working on any solutions, I want to see the problems faced and solutions tried by others.

Pointing towards a solution, I've become an admirer of Nix (http://nixos.org). It solves the problem of jumping between different branches and doing atomic rollbacks in seconds. On the other hand there's a couple of significant challenges with Nix. I won't go into details here, but instead here: https://github.com/dagss/scidist/blob/master/ideas.rst.

On the one hand, I'm a Ph. D. student with 3 paid years ahead of me. On the other hand, I need to do research (and I'm a father of two and can't work around the clock). I'd wish I didn't have to spend any more time on this, but now that a new cluster is coming up and I need to edit Makefiles yet another time, I'm sufficiently frustrated that I might anyway.

Right now my vision is sufficiently many people with sufficient skills coming together for a week-long workshop to build something Nix-like (either based on Nix or just stealing ideas).

Introduction

Every day, countless researcher hours are spent getting software to run. A significant number of scientific Python distributions are available, but none solve everybody's problems.

There's no lack of mailing list threads out there on the subject. This time it was on the mpi4py mailing list, where the feeling was that nobody has really catered to the "HPC" or "large cluster" segment. Rather than plunging ahead and develop yet another scientific Python distribution or set of standards for another special case, let's take a breath and make sure we understand the full problem first. Why aren't the current solutions working, and what do we really want?

Goals:

Avoid redoing the mistakes of the past
Save time by pooling our efforts
Take a step towards more reproducible research
Make scientific Python more attractive

Round 1: Getting to know each other + surveying problems and solutions

Everybody can participate (HPC or desktop, Python or Fortran shouldn't matter at this point). Send me an email to d.s.seljebotn@astro.uio.no and I'll give you posting rights. Then write a post where you:

Introduce yourself and the problems you work with
What have you tried? What are you currently using? Why does this cause problems?
What are the key insights you can draw from your experience?
What way do you see forward? Are you already working on something? How much time and other resources (money for workshops etc.) do you have to work towards a solution?

We hope many will participate so that we all get a broad collection of the problems faced and the solutions tried.

Blog Archive