Status update week 5 GSoC

Continued work on tutorial material

Cross-platform work is always a bit tricky. Python does in general a great job of allowing programs to access the filesystem, network etc in an OS agnostic manner, but sometimes there is no way around having to write special code for each platform. In Python that often means that we will do runtime inspection:

>>> sys.platform
'linux'  # or e.g. darwin
>>> os.name
'posix'  # or e.g. nt

To add another dimension of complexity we sometimes deal with 32-bit and sometimes 64-bit systems (most headaches are caused by pointer sizes, size of int and long), in python we can get the size of int of the platform quite easily:

>>> sys.maxsize - 2**63 + 1
0

when you, in addition to this, are doing code-generation, you are most likely relying on a compiler. That is our third dimension of complexity, on linux that (usually) translates to gcc & llvm/clang (but Intel's icc/ifort & portland group compilers and not uncommon). This third "dimension" has been the number one struggle the past week (and in particular the combination mingw/windows). At this point I feel that it is not worthwhile to continue pursuing mingw support on Windows.

It is quite likely that a few of our attendees will have some technical difficulties with system installed compilers. Therefore, I'm particularly happy that we now have got our binder environment to work flawlessly. It took some work to setup a Dockerfile, but now it feels reassuring to have this as a fall-back.

Work on SymPy for version 1.1

The work on the PythonCodePrinter has progressed since last week and should hopefully soon be ready for merging. The purpose of this class is to be stepping stone moving away from having e.g. lambdify relying on the StrPrinter, and instead use the CodePrinter class in .printing.codeprinter. Changing the super class showed to be trickier than first anticipiated (I thought it would be an afternoons work at most). I originally planned to rename the original PythonPrinter to something more appropriate (SymPyCodePrinter) together with its convenience functions. However, since this would incur a deprecation for only name change, it was decided that we will have to live with the old naming (to avoid raising tons of deprecation warnings in downstream projects).

We had identified two extensions we wanted to introduce to the codegeneration/autowrap factilities before the v1.1 release:

  • Kenneth Lyons (@ixjlyons) spear-headed the effort to allow custom CodeGen subclasses to be passed to codegen
  • ...and Jason the changes to the CythonWrapper to allow compiler arguments to be passed to autowrap.

Plans for the upcoming week

With a release candidate of SymPy v1.1 out the door (an impressive effort led by @asmeurer) and most cross-platform issues out of the way, it is about time for the final iterations of the tutorial material (and hopefully iron out any related issues in the rc).

At the end of the week it's time for me to fly to the US to attend SciPy 2017. I have high expectations and believe that it will be a great catalyst for the rest of the summers work on code-generation.

Status update week 4 GSoC

Continued work on tutorial material

During the past week I got binder to work with our tutorial material. Using environment.yml did require that we had a conda package available. So I set up our instance of Drone IO to push to the sympy conda channel. The new base dockerimage used in the beta version of binder (v2) does not contain gcc by default. This prompted me to add gcc to our environment.yml, unfortunately this breaks the build process:

Attempting to roll back.
LinkError: post-link script failed for package defaults::gcc-4.8.5-7
running your command again with `-v` will provide additional information
location of failed script: /opt/conda/bin/.gcc-post-link.sh
==> script messages <==
<None>

I've reached out on their gitter channel, we'll see if anyone knows what's up. If we cannot work around this we will have two options:

  1. Accept that the binder version cannot compile any generated code
  2. Try to use binder with a Dockerimage based on the one used for CI tests (with the difference that it will include a prebuilt conda environment from environment.yml)

I hope that the second approach should work unless we hit a image size limitation (and perhaps we need a special base image, I haven't looked into those details yet).

Jason has been producing some really high quality tutorial notebooks and left lots of constructive feedback on my initial work. Based on his feedback I've started reworking the code-generation examples not to use classes as extensively. I've also added some introductory notebooks: numerical integration of ODEs in general & intro to SymPy's cryptically named lambdify (the latter notebook is still to be merged into master).

Work on SymPy for version 1.1

After last weeks mentor meeting we decided that we would try to revert a change to sympy.cse (a function which performs common subexpression elimination). I did spend some time profiling the new implementation. However, I was not able to find any obvious bottle-necks, and given that it is not the main scope of my project I did not pursue this any further at the moment.

I also just started looking at introducing a Python code printer. There is a PythonPrinter in SymPy right now, although it assumes that SymPy is available. The plan right now is to rename to old printer to SymPyCodePrinter and have the new printer primarily print expressions, which is what the CodePrinter class does best, even though it can be "coerced" into emitting statements.

Plans for the upcoming week

As the conference approaches the work intensifies both with respect to finishing up the tutorial material and fixing blocking issues for a SymPy 1.1 release. Working on the tutorial material helps finding things to improve code-generation wise (just today I opened a minor issue just to realize that my "work-in-progress" pull-request from an earlier week (which I intend to get back working on next week too) actually fixes said issue.

Status update week 3 GSoC

Fast callbacks from SymPy using SymEngine

My main focus the past week has been to get Lambdify in SymEngine to work with multiple output parameters. Last year Isuru Fernando lead the development to support jit-compiled callbacks using LLVM in SymEngine. I started work on leveraging this in the Python wrappers of SymEngine but my work stalled due to time constraints.

But since it is very much related to code generation in SymPy I did put it into my time-line (later in the summer) in my GSoC application. But with the upcoming SciPy conference, and the fact that it would make a nice addition to our tutorial, I have put in work to get this done earlier than first planned.

Another thing on my to-do-list from last week was to get numba working with lambdify. But for this to work we need to wait for a new upstream release of numba (which they are hoping to release before the SciPy conference).

Status of codegen-tutorial material

I have not added any new tutorial material this week, but have been working on making all notebooks work under all targeted operating systems. However, every change to the notebooks have to be checked on all operating systems using both Python 2 and Python 3. This becomes tedious very quickly so I decided to enable continuous integration on our repository. I followed conda-forges approach: Travis CI for OS X, CircleCI for Linux and AppVeyor for Windows (and a private CI server for another Linux setup). And last night I finally got green light on all 4 of our CI services.

Plans for the upcoming week

We have had a performance regression in sympy.cse which has bit me multiple times this week. I managed to craft a small test case indicating that the algorithmic complexity of the new function is considerably worse than before (effectively making it useless for many applications). In my weekly mentor-meeting (with Aaron) we discussed possibly reverting that change. I will first try to see if I can identify easy-to-fix bottlenecks by profiling the code. But the risk is that it is too much work to be done before the upcoming new release of SymPy, and then we will simply revert for now (choosing speed over extensiveness of the sub-expression elimination).

I still need to test the notebooks using not only msvc under Windows (which is currently used in the AppVeyor tests), but also mingw. I did manage to get it working locally but there is still some effort left in order to make this work on AppVeyor. It's extra tricky since there is a bug in distutils in Python 3 which causes the detection of mingw to fail. So we need to either:

  • Patch cygwincompiler.py in distutils (which I believe we can do if we create a conda package for our tutorial material).
  • ...or use something else than pyximport (I'm hesitant to do this before the conference).
  • ...or provide a gcc executable (not a .bat file) that simply spawns gcc.bat (but that executable would need to be compiled during build of our conda package).

Based on my work on making the CI services work, we will need to provide test scripts for the participants to run. We need to provide the organizers with these scripts by June 27th so this needs to be decided upon during next week. I am leaning towards providing an environment.yml file together with a simple instruction of activating said environment, e.g.:

$ conda env create -f environment.yml
$ source activate codegen17
$ python -c "import scipy2017codegen as cg; cg.test()"

This could even be tested on our CI services.

I also intend to add a (perhaps final) tutorial notebook for chemical kinetics where we also consider diffusion. We will solve the PDE using the method of lines. The addition of a spatial dimension in this way is simple in principle, things do tend to become tricky when handling boundary conditions though. I will try to use the simplest possible treatment in order to avoid taking focus from what we are teaching (code-generation).

It is also my hope that this combined diffusion-reaction model is a good candidate for ufuncify from sympy.utilities.autowrap.

Status update week 2 GSoC

I have spent the second week of Google Summer of Code on essentially two things:

  1. Continued work on type awareness in the code printers (CCodePrinter and FCodePrinter). The work is ongoing in gh-12693.
  2. Writing tutorial code on code-generation in the form of jupyter notebooks for the upcoming SciPy 2017 conference.

Precision aware code printers

After my weekly mentor meeting, we decided to take another approach to how we are going to represent Variable instances in the .codegen.ast module. Previously I had proposed to use quite a number of arguments (stored in .args since it inherits Basic). Aaron suggested we might want to represent that underlying information differently. After some discussion we came to the conclusion that we could introduce a Attribute class (inhereting from Symbol) to describe things such as value const-ness, and pointer const-ness (those two are available as value_const and pointer_const). Attributes will be stored in a FiniteSet (essentially SymPy version of set) and the instances we provide as "pre-made" in the .codegen.ast module will be supported by the printers by default. Here is some example code showing what the current proposed API looks like (for C99):

>>> u = symbols('u', real=True)
>>> ptr = Pointer.deduced(u, {pointer_const, restrict})
>>> ccode(Declaration(ptr))
'double * const restrict u;'

and for Fortran:

>>> vx = Variable(Symbol('x'), {value_const}, float32)
>>> fcode(Declaration(vx, 42))
'real(4), parameter :: x = 42'

The C code printer can now also print code using different math functions depending on the targeted precision (functions guaranteed to be present in the C99 stanard):

>>> ccode(x**3.7, standard='c99', precision=float32)
'powf(x, 3.7F)'
>>> ccode(exp(x/2), standard='c99', precision=float80)
'expl((1.0L/2.0L)*x)'

Tutorial material for code generation

Aaron, Jason and I have been discussing what examples to use for the tutorial on code generation with SymPy. Right now we are aiming to use quite a few examples from chemistry actually, and more specifically chemical kinetics. This is the precise application which got me started using SymPy for code-generation, so it lies close to my heart (I do extensive modeling of chemical kinetics in my PhD studies).

Working on the tutorial material has already been really helpful for getting insight in development needs for the exisiting classes and functions used for code-generation. I was hoping to use autowrap from the .utilities module. Unfortunately I found that it was not flexible enough to be useful for integration of systems of ODEs (where we need to evaluate a vector-valued function taking a vector as input). I did attempt to subclass the CodeWrapper class to allow me to do this. But personally I found those classes to be quite hard to extend (much unlike the printers which I've often found to be intuitive).

My current plan for the chemical kinetics case is to first solve it using sympy.lambdify. That allows for quick prototyping, and unless one has very high demands with respect to performance, it is usually good enough.

The next step is to generate a native callback (it was here I was hoping to use autowrap with the Cython backend). The current approach is to write the expressions as Cython code using a template. Cython conveniently follows Python syntax, and hence the string printer can be used for the code generation. Doing this speeds up the integration considerably. At this point the bottleneck is going back and forth through the Python layer.

So in order to speed up the integration further, we need to bypass Python during integration (and let the solver call the user provided callbacks without going through the interpreter). I did this by providing a C-code template which relies on CVode from the Sundials suite of non-linear solvers. It is a well established solver and is already available for Linux, Mac & Windows under conda from the conda-forge channel. I then provide a thin Cython wrapper, calling into the C-function, which:

  1. sets up the CVode solver
  2. runs the integration
  3. records some statistics

Using native code does come at a cost. One of the strengths of Python is that it is cross-platform. It (usually) does not matter if your Python application runs on Linux, OSX or Windows (or any other supported operating system). However, since we are doing code-generation, we are relying on compilers provided by the respective platform. Since we want to support both Python 2 & Python 3 on said three platforms, there are quite a few combinations to cover. That meant quite a few surprises (I now know for example that MS Visual C++ 2008 does not support C99), but thanks to the kind help of Isuru Fernando I think I will manage to have all platform/version combinations working during next week.

Also planned for next week:

  • Use numba together with lambdify
  • Use Lambdify from SymEngine (preferably with the LLVM backend).
  • Make the notebooks more tutorial like (right now they are more of a show-case).

and of course: continued work on the code printers. That's all for now feel free to get in touch with any feedback or questions.

A summer of code and mathematics

Google are generously funding work on selected open source projects each year through the Google Summer of Code project. The project allows under- and post-graduate students around the world to apply to mentoring organizations for a scholarship to work on a project during the summer. This spring I made the leap, I wrote a proposal which got accepted, and I am now working full time for the duration of this summer on one of these projects. In this blog post I'll give some background and tell you about the first project week.

Background

Since a few years I've been contributing code to the open-source project SymPy. SymPy is a so-called "computer algebra system", which lets you manipulate mathematical expressions symbolically. I've used this software package extensively in my own doctoral studies and it has been really useful.

My research involves formulating mathematical models to: rationalize experimental observations, fit parameters or aid in design of experiments. Traditionally one sits down and derive equations, often using pen & paper, then one writes computer code which implements said model, and finally one writes a paper with the same formulas as LaTeX code (or something similar). Note how this procedure involves writing the same equations essentially three times, during derivation, coding and finally the article.

By using SymPy I can, from a single source:

  1. Do the derivations (fewer hard-to-find mistakes)
  2. Generate the numerical code (a blazing fast computer program)
  3. Output LaTeX formatted equations (pretty formulas for the report)

A very attractive side-effect of this is that one truly get reproducible research (reproducibility is one of the pillars in science). Every step of the process is self-documented, and because SymPy is free software: anyone can redo them. I can't stress enough how big this truly is. It is also the main motivation why I haven't used proprietary software in place of SymPy, even though that software may be considerably more feature complete than SymPy, any code I wrote for it would be inaccessible to people without a license (possibly even including myself if I leave academia).

For this work-flow to work in practice the capabilities of the computer algebra system need to be quite extensive, and it is here my current project with SymPy comes in. I have had several ideas on how to improve capability number two listed above: generating the numerical code, and now I get the chance to realize some of them and work with the community to improve SymPy.

First week

The majority of the first week has been spent on introducing type-awareness into the code-printers. SymPy has printer-classes which specialize printing of e.g. strings, C code, Fortran code etc. Up to now there has been no way to indicate what precision the generated code should be for. The default floating point type in python is for example "double precision" (i.e. 64-bit binary IEEE 754 floating point). This is also the default precision targeted by the code printers.

However, there are occasions where one want to use another precision. For example, consumer class graphics cards which are ubiquitous often have excellent single precision performance, but are intentionally capped with respect to double precision arithmetic (due to marketing reasons). At other times, one want just a bit of extra precision and extended precision (80-bit floating point, usually the data type of C's long double) is just what's needed to compute some values with the required precision. In C, the corresponding math functions are standardized since C99.

I have started the work to enable the code printers to print this in a pull-request to the SymPy source repository. I have also started experimenting with a class representing arrays. Arrays

The first weekly meeting with Aaron Meurer went well and we also briefly discussed how to reach out to the SymPy community for wishes on what code-generation functions to provide, I've set up a wiki-page for it under the SymPy projects wiki:

https://github.com/sympy/sympy/wiki/codegen-gsoc17

I'll be sending out an email to the mailing list for SymPy asking for feedback.

We also discussed the upcoming SciPy 2017 conference where Aaron Meurer and Jason Moore will be giving a tutorial on code-generation with SymPy. They've asked me to join forces with them and I've happily accepted that offer and am looking forward to working on the tutorial material and teaching fellow developers and researchers in the scientific python community about how to leverage SymPy for code generation.

Next blog post will most likely be a bit more technical, but I thought it was important to give some background on what motivates this effort and what the goal is.

Writing tests for Python

A coworker asked me about "tests" from a programming practice perspective, "what's the big deal?". For me, writing tests is a very small price to pay to comfortably daring to change code without the fear of breaking it. Every bug I fix, even during the design phase, gives rise to at least one test.

No library or script is too small to have tests, here is a silly example (let's call it rootof.py):

#!/usr/bin/env python

import argh

def root_of(number=1.0):
    return number**0.5

def test_root_of():
    assert root_of(4) == 2
    assert abs(root_of(2) - 1.4142) < 1e-4

if __name__ == '__main__':
    argh.dispatch_command(root_of)

The example is so short it's silly, but the point is that each time you edit rootof.py you should run

$ python -m pytest rootof.py

which should exit without errors. That's pretty much the gist of it. Happy testing!

P.S. if you don't have argh just install it by typing:

$ python -m pip install --user argh

P.P.S if you don't have pip: see this.

Linear decay chains and the Bateman equation

Linear decay chains and the Bateman equation

This jupyter notebook aims to show how one may interactively work with SymPy to get help with integrating differential equations, checking correctness of solutions and generating numerical solutions with arbitrary precision. All of which is really handy when equations get complicated.

Consider the following decay chain

$$ u \overset{k}{\rightarrow} v \overset{l}{\rightarrow} w \overset{m}{\rightarrow} x $$

we will use SymPy to "manually" solve the system of diffential equations describing the time evolution of the amounts (concentrations) of respective species.

In [1]:
from IPython.display import display
from sympy.interactive import printing
printing.init_printing()

import sympy
from sympy import S, symbols, Function, Symbol, Eq, exp, Integral, refine, lambdify, latex
print(sympy.__version__)
1.0.1.dev

Let us define the variables, $t$ is time.

In [2]:
params = k, l, m = symbols('k l m'.split(), positive=True)
params.append(S('0'))  # x does not decay, S('0') gives us a "SymPy zero"
t = Symbol('t')
funcs = u, v, w, x = [Function(s)(t) for s in 'uvwx']
inits = u0, v0, w0, x0 = symbols('u0 v0 w0 x0'.split())  # initial values at t=0
display(params, t, funcs, inits)
$$\left [ k, \quad l, \quad m, \quad 0\right ]$$
$$t$$
$$\left [ u{\left (t \right )}, \quad v{\left (t \right )}, \quad w{\left (t \right )}, \quad x{\left (t \right )}\right ]$$
$$\left [ u_{0}, \quad v_{0}, \quad w_{0}, \quad x_{0}\right ]$$

Let us define our governing equations as eqs

In [3]:
eqs = [Eq(u.diff(t), -k*u), Eq(v.diff(t), k*u - l*v), Eq(w.diff(t), l*v - m*w), Eq(x.diff(t), m*w)]
eqs
Out[3]:
$$\left [ \frac{d}{d t} u{\left (t \right )} = - k u{\left (t \right )}, \quad \frac{d}{d t} v{\left (t \right )} = k u{\left (t \right )} - l v{\left (t \right )}, \quad \frac{d}{d t} w{\left (t \right )} = l v{\left (t \right )} - m w{\left (t \right )}, \quad \frac{d}{d t} x{\left (t \right )} = m w{\left (t \right )}\right ]$$

We will need som integration constants, let us call them C. We also create a list called sol to collect our analytic solutions as we go along

In [4]:
C = symbols('C:4')  # create C0, C1, C2, C3
solutions = [S('0')]*4  # sol is length 4 list of zeroes

Let's focus on the first equation and try an ansatz for $u(t)$

In [5]:
solutions[0] = inits[0]*exp(-k*t)  # our guess
eqs[0].subs({funcs[0]: solutions[0]})  # substitute our functions "u(t)" with our "u0*exp(-k*t)"
Out[5]:
$$\frac{\partial}{\partial t}\left(u_{0} e^{- k t}\right) = - k u_{0} e^{- k t}$$
In [6]:
# Last cells result is saved in the variable "_", let's evaluate that Equality
_.doit()  # if it returns True left hand side was equal to the right hand side
Out[6]:
$$\mathrm{True}$$

Let us verify that the initial value is reproduced for $t=0$:

In [7]:
solutions[0].subs({t: 0})
Out[7]:
$$u_{0}$$

and so we have, using SymPy, verified that our trial function was correct. This is convenient, let's make it a bit more general:

In [8]:
def discrepancy(eq):
    return (eq.lhs.doit() - eq.rhs.doit()).simplify()

def check_trial(trial, subsd=None):
    subsd = subsd or {}  # subsd is an optional dict of substitutions
    return [discrepancy(eq.subs(dict(zip(funcs, trial))).subs(subsd))==0 for eq in eqs]

Now let's see if our convience function check_trial returns True for the first element of sol:

In [9]:
check_trial(solutions)[:1]  # still only focusing on the first equation
Out[9]:
[True]

Great! Let's write a convenience function to check that each solution reproduces the correct inital value for $t=0$

In [10]:
def check_init(trial):
    return [tr.subs({t: 0}).simplify() == inits[idx] for idx, tr in enumerate(trial)]
check_init(solutions)[:1]
Out[10]:
[True]

no surprises there, we will use check_trial(solutions) and check_init(solutions) as we go along. Now, let us look at how the next differential equation is formulated now that we have an explicit expression for $u(t)$:

In [11]:
eqs[1].subs({funcs[0]: solutions[0]})
Out[11]:
$$\frac{d}{d t} v{\left (t \right )} = k u_{0} e^{- k t} - l v{\left (t \right )}$$

Integrating factor

The kind of differential equation above may be solved using an integrating factor. For an equation of type:

$ y' + P(t)y = Q(t) $

using an <a href=http://en.wikipedia.org/wiki/Integrating_factor>integrating factor</a>, M:

$ M = e^{\int_{0}^t P(s)ds} $

we have:

$ y = \frac{\int_0^t{MQ}}{M} + \frac{y(0)}{M} $

Identifying these terms in our differential equation for $v(t)$ leads to the following expressions:

In [12]:
M = exp(l*t)
Q = k*solutions[0]
v_integral = Integral(M*Q, (t, 0, t))/M + inits[1]/M
v_integral
Out[12]:
$$v_{0} e^{- l t} + e^{- l t} \int_{0}^{t} k u_{0} e^{- k t} e^{l t}\, dt$$

we can see that the expression for $y$ depends on whether $ k = l $ or $ k \neq l $, this is important as we will see quite soon. First let's solve the integral by hand for $k \neq l$

In [13]:
# for k != l, with pen and paper we integrate:
trial_neq = exp(-l*t)*(inits[1]+k*inits[0]*(exp(t*(l-k))/(l-k) - 1/(l-k)))
trial_neq
Out[13]:
$$\left(k u_{0} \left(\frac{e^{t \left(- k + l\right)}}{- k + l} - \frac{1}{- k + l}\right) + v_{0}\right) e^{- l t}$$
In [14]:
check_trial([solutions[0], trial_neq])[:2]
Out[14]:
[True, True]
In [15]:
check_init([solutions[0], trial_neq])
Out[15]:
[True, True]

Alright, trial_neq is a valid solution (True tells us with didn't mess up when doing this with pen and paper).

Now let's do the same for the (simpler) case $k = l$:

In [16]:
# for k == l, with pen and paper we integrate:
trial_eq = exp(-l*t)*(inits[1]+k*inits[0]*t)
trial_eq
Out[16]:
$$\left(k t u_{0} + v_{0}\right) e^{- l t}$$

For checking the case of $l=k$ we need to substitute $l$ with $k$ (or vise versa), the second input argument in our convenience function was added just for this reason:

In [17]:
check_trial([solutions[0], trial_eq], {l: k})[:2], check_init([solutions[0], trial_eq])
Out[17]:
([True, True], [True, True])

No surprises there: trial_eq is also a valid solution. We have now verified that our manual labour was done correctly. Next step is integrating $w$, we already realize that its closed form depends on wether $k = l$ holds, and we might (correctly) suspect that the same question will arise for $l = m$. We will in the following assume that $ k \neq l $, $ l \neq m $ and $ k \neq m $. This is is also an interesting case which has a general solution found by Bateman, which we will look closer at soon.

We found the integrals for our trial function by hand. We will now look into how we could have made SymPy do this for us.

In [18]:
vsol = v_integral.doit()
vsol
Out[18]:
$$v_{0} e^{- l t} + \left(- k u_{0} \left(\begin{cases} 0 & \text{for}\: l = k \\- \frac{1}{k - l} & \text{otherwise} \end{cases}\right) + k u_{0} \left(\begin{cases} t & \text{for}\: l = k \\- \frac{e^{l t}}{k e^{k t} - l e^{k t}} & \text{otherwise} \end{cases}\right)\right) e^{- l t}$$

SymPy correctly identifies the two solutions, now let's tell SymPy that we are interested in the solution for $k \neq l$.

In [19]:
eq_assumption = sympy.Q.is_true(Eq(l, k))
vsol_eq = refine(vsol, eq_assumption).simplify()
vsol_neq = refine(vsol, ~eq_assumption).simplify()
solutions[1] = vsol_eq
vsol_neq, vsol_eq
Out[19]:
$$\left ( \frac{e^{- t \left(k + l\right)}}{k - l} \left(k u_{0} \left(e^{k t} - e^{l t}\right) + v_{0} \left(k - l\right) e^{k t}\right), \quad \left(k t u_{0} + v_{0}\right) e^{- l t}\right )$$

Let us see if that is the same result as we got by hand:

In [20]:
(trial_neq - vsol_neq).simplify()
Out[20]:
$$0$$

Indeed it is.

Ok, now let's see if we can put all our steps for solving $v(t)$ into an algorithm and apply it for $w(t)$:

In [21]:
def integrate_using_integrating_factor(idx, **kwargs):
    # Note that this functions uses some global state (which is sub-optimal):
    #   inits, params, solutions, t
    M = exp(params[idx]*t)
    Q = params[idx-1]*solutions[idx-1]
    y_int = Integral(M*Q, (t, 0, t)).doit(**kwargs)/M + inits[idx]/M
    assumptions = None
    for idx2 in range(idx-1,-1,-1):
        # k != l != m != ...
        if assumptions == None:
            assumptions = ~sympy.Q.is_true(Eq(params[idx], params[idx2]))
        else:
            assumptions = assumptions & ~sympy.Q.is_true(Eq(params[idx], params[idx2]))
    return refine(y_int, assumptions)

Let's test this function it for $v(t)$:

In [22]:
integrate_using_integrating_factor(1).simplify()
Out[22]:
$$\frac{e^{- t \left(k + l\right)}}{k - l} \left(k u_{0} \left(e^{k t} - e^{l t}\right) + v_{0} \left(k - l\right) e^{k t}\right)$$

It turns out that the integration takes a very long time if we try to use this for $w(t)$, so let's see if we can help sympy by giving it a hint. We will use $v(t)$ as a benchmark

In [24]:
v_int_part = Integral(M*Q, (t, 0, t))
%timeit v_int_part.doit()
v_int_part.doit()
1 loop, best of 3: 217 ms per loop
Out[24]:
$$- k u_{0} \left(\begin{cases} 0 & \text{for}\: l = k \\- \frac{1}{k - l} & \text{otherwise} \end{cases}\right) + k u_{0} \left(\begin{cases} t & \text{for}\: l = k \\- \frac{e^{l t}}{k e^{k t} - l e^{k t}} & \text{otherwise} \end{cases}\right)$$

The sympy documentation tells us to look at the docstring of sympy.Integral._eval_integral. From there we learn that one algortihm "Meijerg" is supposed to be efficient for definite integrals so we give it a try:

In [25]:
v_int_part.doit(meijerg=True)
Out[25]:
$$- \frac{k u_{0} e^{l t}}{k e^{k t} - l e^{k t}} + \frac{k u_{0}}{k - l}$$

Note that meijerg=True caused sympy to miss the special case of $k=l$

Now let's see how fast it is:

In [26]:
%timeit v_int_part.doit(meijerg=True)
10 loops, best of 3: 67.4 ms per loop

More than twice as fast, let's try it for $w$

In [27]:
import time
time0 = time.time()
solutions[2] = integrate_using_integrating_factor(2, meijerg=True)  # Not passing meijerg = True takes too long..
print("integration took %3.1f seconds." % (time.time()-time0))
solutions[2]
integration took 1.8 seconds.
Out[27]:
$$w_{0} e^{- m t} + \left(- l \left(- \frac{k u_{0}}{l^{2} - 2 l m + m^{2}} - \frac{l v_{0}}{l^{2} - 2 l m + m^{2}} + \frac{m v_{0}}{l^{2} - 2 l m + m^{2}}\right) + l \left(- \frac{k l t u_{0} e^{- l t} e^{m t}}{l^{2} - 2 l m + m^{2}} + \frac{k m t u_{0} e^{- l t} e^{m t}}{l^{2} - 2 l m + m^{2}} - \frac{k u_{0} e^{- l t} e^{m t}}{l^{2} - 2 l m + m^{2}} - \frac{l v_{0} e^{- l t} e^{m t}}{l^{2} - 2 l m + m^{2}} + \frac{m v_{0} e^{- l t} e^{m t}}{l^{2} - 2 l m + m^{2}}\right)\right) e^{- m t}$$

We can try for $x(t)$ as well, it takes about 30s on my laptop:

In [28]:
import time
time0 = time.time()
solutions[3] = integrate_using_integrating_factor(3, meijerg=True)
print("integration took %3.1f seconds." % (time.time()-time0))
solutions[3]
integration took 773.5 seconds.
Out[28]:
$$- m \left(\frac{k l^{2} u_{0}}{l^{4} - 2 l^{3} m + l^{2} m^{2}} - \frac{k l m u_{0}}{l^{4} - 2 l^{3} m + l^{2} m^{2}} - \frac{k l u_{0}}{l^{2} m - 2 l m^{2} + m^{3}} + \frac{k l u_{0}}{l^{3} - 2 l^{2} m + l m^{2}} - \frac{l^{2} v_{0}}{l^{2} m - 2 l m^{2} + m^{3}} + \frac{l^{2} v_{0}}{l^{3} - 2 l^{2} m + l m^{2}} + \frac{l m v_{0}}{l^{2} m - 2 l m^{2} + m^{3}} - \frac{l m v_{0}}{l^{3} - 2 l^{2} m + l m^{2}} - \frac{w_{0}}{m}\right) + m \left(\frac{k l^{3} t u_{0}}{l^{4} e^{l t} - 2 l^{3} m e^{l t} + l^{2} m^{2} e^{l t}} - \frac{k l^{2} m t u_{0}}{l^{4} e^{l t} - 2 l^{3} m e^{l t} + l^{2} m^{2} e^{l t}} + \frac{k l^{2} u_{0}}{l^{4} e^{l t} - 2 l^{3} m e^{l t} + l^{2} m^{2} e^{l t}} - \frac{k l m u_{0}}{l^{4} e^{l t} - 2 l^{3} m e^{l t} + l^{2} m^{2} e^{l t}} - \frac{k l u_{0}}{l^{2} m e^{m t} - 2 l m^{2} e^{m t} + m^{3} e^{m t}} + \frac{k l u_{0}}{l^{3} e^{l t} - 2 l^{2} m e^{l t} + l m^{2} e^{l t}} - \frac{l^{2} v_{0}}{l^{2} m e^{m t} - 2 l m^{2} e^{m t} + m^{3} e^{m t}} + \frac{l^{2} v_{0}}{l^{3} e^{l t} - 2 l^{2} m e^{l t} + l m^{2} e^{l t}} + \frac{l m v_{0}}{l^{2} m e^{m t} - 2 l m^{2} e^{m t} + m^{3} e^{m t}} - \frac{l m v_{0}}{l^{3} e^{l t} - 2 l^{2} m e^{l t} + l m^{2} e^{l t}} - \frac{w_{0}}{m} e^{- m t}\right) + x_{0}$$

Ok, now the assumptions are getting tricky (or rather, SymPy's way to handle them here makes life hard for us). Let us therefore abandon this approach for a little while and look at the Bateman Equation.

Bateman Equation

In his paper from 1910 Bateman solves the system of differential equations by first taking the Laplace transform of the dependent variables (for u: $U(t) = \int_0^\infty e^{-ts}u(s)ds$). For the case of no source terms, and where there are no daughters at $t=0$ the solution is (indexing starts at 0):

$$ N_i = N_0(0) \left( \prod_{j=0}^{i-1} \lambda_j \right) \sum_{k=0}^{i} \frac{ e^{-\lambda_k t} }{ \prod_{l=0,l\neq k}^{i} \lambda_l - \lambda_k } $$

Let us impement that equation as bateman_parent() indicating it is valid for a system starting with no daughters:

In [29]:
def bateman_parent(lmbd, t, one=1, zero=0, exp=None):
    n = len(lmbd)
    N = [None]*n
    lmbd_prod = one
    if exp == None:
        import math
        exp = math.exp
    for i in range(n):
        if i > 0:
            lmbd_prod *= lmbd[i-1]
        sum_k = zero
        for k in range(i+1):
            prod_l = one
            for l in range(i+1):
                if l == k:
                    continue
                prod_l *= lmbd[l] - lmbd[k]
            sum_k += exp(-lmbd[k]*t)/prod_l
        N[i] = lmbd_prod*sum_k
    return N

For a single decay with unit initial number density we get:

In [30]:
[expr.simplify() for expr in bateman_parent([k, S('0')], t, S('1'), S('0'), sympy.exp)]
Out[30]:
$$\left [ e^{- k t}, \quad 1 - e^{- k t}\right ]$$

So that looks promising, let's write a wrapping function bateman_full() for the general solution with possibly finite initial daughter concentrations. One may quickly realize that the expression must be a linear combination of (shorter) decay chains where the indiviual chains represent systems with no daughters:

In [31]:
def bateman_full(y0s, lmbd, t, one=1, zero=0, exp=None):
    n = len(lmbd)
    if len(y0s) != n:
        raise ValueError("Please pass equal number of decay"
                         " constants as initial concentrations"
                         " (you may want to pad lmbd with zeroes)")
    N = [zero]*n
    for i, y0 in enumerate(y0s):
        if y0 == zero:
            continue
        Ni = bateman_parent(lmbd[i:], t, one, zero, exp)
        for j, yj in enumerate(Ni, i):
            N[j] += y0*yj
    return N

Applying bateman_full(...) on our initial problem gives us the following solutions:

In [32]:
bsol = [expr.simplify() for expr in bateman_full(inits, params, t, S('1'), S('0'), sympy.exp)]
for func, bs in zip(funcs, bsol):
    display(Eq(func, bs))
$$u{\left (t \right )} = u_{0} e^{- k t}$$
$$v{\left (t \right )} = \frac{e^{- t \left(k + l\right)}}{k - l} \left(k u_{0} e^{k t} - k u_{0} e^{l t} + k v_{0} e^{k t} - l v_{0} e^{k t}\right)$$
$$w{\left (t \right )} = k l u_{0} \left(\frac{e^{- m t}}{\left(k - m\right) \left(l - m\right)} - \frac{e^{- l t}}{\left(k - l\right) \left(l - m\right)} + \frac{e^{- k t}}{\left(k - l\right) \left(k - m\right)}\right) + l v_{0} \left(\frac{e^{- m t}}{l - m} - \frac{e^{- l t}}{l - m}\right) + w_{0} e^{- m t}$$
$$x{\left (t \right )} = - k l m u_{0} \left(\frac{e^{- m t}}{m \left(k - m\right) \left(l - m\right)} - \frac{e^{- l t}}{l \left(k - l\right) \left(l - m\right)} + \frac{e^{- k t}}{k \left(k - l\right) \left(k - m\right)} - \frac{1}{k l m}\right) + l m v_{0} \left(- \frac{e^{- m t}}{m \left(l - m\right)} + \frac{e^{- l t}}{l \left(l - m\right)} + \frac{1}{l m}\right) + m w_{0} \left(\frac{1}{m} - \frac{1}{m} e^{- m t}\right) + x_{0}$$

We note that the equations assume $k>0, l>0, m>0$ and all unique.

Let's verify that the solutions satisify our differential equations and our initial value problem:

In [33]:
check_trial(bsol), check_init(bsol)
Out[33]:
([True, True, True, True], [True, True, True, True])

Let's plot the solutions, first we need some fast callbacks which we generate with the SymPy function lambdify:

In [34]:
callbacks = [lambdify((t, params[:-1], inits), bs) for bs in bsol]

Let's assert that the callback for $u(t)$ is calculating the correct answer for k=1, t=1.0, u0=3.0:

In [35]:
callbacks[0](1.0, [1, 2, 3], [3, 2, 1, 0]), sympy.N(3*exp(-1), 12)
Out[35]:
$$\left ( 1.10363832351, \quad 1.10363832351\right )$$

Let's define some convenience functions for calculating arrays of solutions and plotting those:

In [36]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def solution_vectors(t0=0, tend=4, nt=100, param_vals=None, init_vals=None, modules=None):
    callbacks = [lambdify((t, params[:-1], inits), bs, modules) for bs in bsol]
    tarr_ = np.linspace(t0, tend, nt)
    params_ = param_vals or (1, 2, 3)
    inits_ = init_vals or (3, 2, 1, 0)
    ysols_ = [[cb(t_, params_, inits_) for t_ in tarr_] for cb in callbacks]
    return tarr_, np.asarray(ysols_)

def plot_solutions(t_vals, y_vals):
    plt.figure(figsize=(8,5))
    for idx, ysol_ in enumerate(y_vals):
        plt.plot(t_vals, ysol_, label='$%s$' % latex(funcs[idx]))
    plt.xlabel('t')
    plt.ylabel('N')
    plt.legend(loc='best', prop={'size': 16})

And let's put those helper functions to work:

In [37]:
t_vals, y_vals = solution_vectors()
plot_solutions(t_vals, y_vals)

So, this look great. Bateman's equation does however exhibit one unpleasent feature when implemented in algortihms using finite precision floating point arithmetics: if the decay constants are almost equal catastrophic cancellation may occur which induce major loss in precision (double precision numbers are 64 bit long). Let's see what that might look like:

In [38]:
double_precision = solution_vectors(param_vals=(3, 3-1e-7, 3+1e-7))
plot_solutions(double_precision[0], double_precision[1])

We see noise in the solution curves stemming from cancellation. SymPy can use arbitrary precision arithmetics, we will now look at how that affects the solution trajectories. First let's see if we can use "sympy" as the module for lambdify:

In [39]:
multi_precision = solution_vectors(modules='sympy', param_vals=(S(3), 3-S('10**-7'), 3+S('10**-7')))
plot_solutions(multi_precision[0], multi_precision[1])

No luck there, there seems as if there is an implicit conversion to double precision somewhere(?), we try to go the extra mile and write our own callbacks without using lambdify:

In [40]:
def callback_factory(bs):
    def callback(t_, params_, inits_):
        return bs.subs(dict([(t, t_)] + list(zip(params[:-1], params_)) + list(
                zip(inits, inits_))))
    return callback
    
def calc_solutions_mp(t0=0, tend=4, nt=100, param_vals=None, init_vals=None, modules=None):
    callbacks = []
    for bs in bsol:
        callbacks.append(callback_factory(bs))
    tarr_ = [t0 + i*(tend-t0)/S(nt-1) for i in range(nt)]
    params_ = param_vals or (1, 2, 3)
    inits_ = init_vals or (3, 2, 1, 0)
    ysols_ = [[cb(t_, params_, inits_) for t_ in tarr_] for cb in callbacks]
    return tarr_, np.asarray(ysols_)
In [41]:
multi_precision = calc_solutions_mp(param_vals=(S(3), 3-S('10**-7'), 3+S('10**-7')))

plot_solutions(multi_precision[0], multi_precision[1])

Success!

The generation of solution above is significantly slower due to the extra overhead of arbitrary precision arithmetics.

Another remedy to the problem is to integrate the system of ordinary differential equations numerically using an ODE solver. But that is a whole topic in its own right and this is about as much as I aspired to show in this notebook.

Creative Commons License
<span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/InteractiveResource" property="dct:title" rel="dct:type">Linear decay chains and the Bateman equation</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Björn I. Dahlgren</span> is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Dealing with temporaries in bash.rst

It is quite common that I want to create a temporary file or directory during the exectuion of a bash script. For this we may use mktemp and mktemp -d respectively.

The problem that arises is that you are not sure that your clean up code gets executed if the user aborts the script by e.g. hitting Ctrl+c. An elegant solution was given here http://unix.stackexchange.com/a/149093/119480

to illustrate the solution let's consider the following script:

#!/bin/bash
TMPDIR=$(mktemp -d)
echo $TMPDIR
cleanup() {
    rm -r $TMPDIR
    exit
}
trap cleanup INT TERM
echo "A" | tee $TMPDIR/a.txt
sleep 3  # Try hitting Ctrl-C here (use Ctrl-Z to inspect files)
echo "B" | tee $TMPDIR/b.txt
cleanup
echo "C" | tee $TMPDIR/b.txt # will never be reached

The above code removes the temporary directory created at exit, no matter if the script is allowed to finnish or aborted by the user invoking kill on the PID (defaults to SIGTERM) or hits Ctrl+C (sends the SIGINT signal).

Reproducible research with Docker

When I analyse data in my research I often tend to write small scripts for each of the steps. Those scripts often rely on 3rd party packages being installed. It has happened more than once that a few years down the road those scripts no longer work since the 3rd party package I was depending on changed its API. Until now I have looked at it as a hopeless (and time consuming) battle...

Docker to the rescue

Docker is a tool which has gained a tremendous traction in the web community recently. However, it can also be tamed into a useful tool for scientific computing.

I will try to guide you through the steps necessary to set things up. I will assume familiarity with Linux, git, bash and basic idea of how Docker works.

Separate raw from generated data (input vs. output)

Here are a few things I found made my life much easier when trying to do reproducible research (based on scripts):

  1. Distinguish between input and output and make sure that generated data is built out-of-source.
  2. Use version control (from day one) for the input and the environment description.
  3. Rebuild from scratch frequently to avoid introducing hidden dependencies. Preferably: for each commit on a different host (continuous integration)

We will start simple with a contrived example. Say we have:

  1. Some raw data of relative humidity ("rh") that is believed to be normally distributed.
  2. A small Python script for calculating the mean and standard deviation (in reality the analysis might be something more novel).
  3. A script for generating a "report" of said analysis.

Let's set things up from the command line:

$ mkdir rh; cd rh
$ mkdir input output environment
$ echo output/ >.gitignore
$ cat <<EOF >input/data.txt
23.16
22.87
25.23
25.01
23.93
24.56
EOF

OK, the data is in place, let's write a script to do the analysis:

$ cat <<EOF >input/analysis.py
#!/usr/bin/env python
import numpy as np

def stddev(srcpath='data.txt', format='{0:12.5g} {1:12.5g}'):
    arr = np.loadtxt(srcpath)
    avg = np.mean(arr)
    s = (np.sum((arr-avg)**2/len(arr)))**0.5
    return format.format(avg, s)

if __name__ == '__main__':
    import argh
    argh.dispatch_command(stddev)

EOF
$ chmod +x ./input/analysis.py

Now let's write a script for generating the report.

$ cat <<'EOF' >input/generate_report.sh
#!/bin/bash
AVG=$(./analysis.py --format '{0:12.5g}')
DEV=$(./analysis.py --format '{1:12.5g}')
echo "The average of the data was $AVG and the standard deviation
was estimated to be $DEV">../output/report.txt
EOF
$ chmod +x ./input/generate_report.sh

Alright, now let's try to make make this procedure reproducible by locking down all versions used (i.e. versions of Python, NumPy, bash). We will do so by relying on a specific (long term support) Linux distribution.

$ cat <<'EOF' >environment/Dockerfile
FROM ubuntu:14.04.2
ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && \
    apt-get --quiet --assume-yes install \
    python-numpy python-argh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
EOF

And now let's write a small script orchestrating the full process:

$ cat <<'EOF' >generate_output.sh
#!/bin/bash -x
MY_DOCKER_IMAGE="./environment"
MY_HASH=$(docker build $MY_DOCKER_IMAGE | tee /dev/tty | tail -1 | cut -d' ' -f3)
docker run --rm -v $(pwd)/input:/input -v $(pwd)/output:/output -w /input -i $MY_HASH ./generate_report.sh

EOF
$ chmod +x generate_output.sh

And that's it. Obviously the relative overhead of tracking all the dependencies for such a small example is ridiculously high but most of the above code is of boiler plate character and may easily be copied between projects. Now let's make sure the script works (note that docker requires root privileges so use sudo or add your user to "docker" group):

$ sudo ./generate_output.sh
$ cat ./output/report.txt
The average of the data was       24.127 and the standard deviation
was estimated to be      0.88861

Great, so this would be a good point to set up version control, e.g. by using git:

$ git init .
$ git add -A
$ git commit -m "Initial commit"

This should make the analysis reproducible for the foreseeable future (we are assuming that both Docker, the Ubuntu 14.04.2 image and the Ubuntu 14.04 package repositories will stay around indefinitely which probably isn't true).

Room for improvement

By using docker we can get more benefits for free. We can avoid involuntarily relying on internet access during the generation of our output by passing the flag --net='none' to docker run.

We can also force the ./input directory to be read-only during the build process to better enforce the distinction between raw and generated data.

docker run --rm -v $(pwd)/input:/input:ro -v $(pwd)/output:/output -w /input --net='none' -i $MY_DOCKER_IMAGE ./generate_report.sh

One thing you will soon notice is that docker runs as UID=0 (root), which means that files generated in output will not be owned by your current user. One way to circumvent this is to have a small script in input/ setting the appropriate ownership after having run the ./generate_report.sh script. We will need to provide our preferred UID and GID to the docker image through the use of environment variables:

docker run --rm -e HOST_UID=$(id -u) -e HOST_GID=$(id -g) -v $(pwd)/input:/input:ro -v $(pwd)/output:/output -w /input -i $MY_DOCKER_IMAGE ./entrypoint.sh
#!/bin/bash
# this is entrypoint.sh
./generate_report.sh
chown -R $HOST_UID:$HOST_GID ../output

Installing latest docker on Ubuntu 14.04

To install the latest version of Docker in Trusty you may proceed as follows:

$ wget -qO- https://get.docker.io/gpg | sudo apt-key add -
$ sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$ sudo apt-get update
$ sudo apt-get install lxc-docker apparmor
$ sudo docker -d &

Finally a new website!

I am now since a little more than two years doing a PhD in Chemistry with focus on radiation induced processes at the solid/water interface.

On this blog I hope to - if time permits - write about some of the problems of more general interest (and their solution) popping up as I am striving forward.

On a technical note, this blog is staticially generated using Nikola (EDIT 2016-07-27: It used to be Pelican). The source for this site is hosted at github.