Ask HN: Tools/approaches to automate complicated software pipelines?

3 points

16 years ago

What approaches and tools do you use to help organize and run complicated pipelines and batch jobs? For example, I'm making pipelines for bioinformatics analyses, which we want to be maintainable, self-documenting, reproducible, and easy for end-users to run. These would run a large number of different programs and scripts on some input data, potentially with different permutations of parameters.

Some of the kinds of features that would be nice are: * Running jobs multiple times with different parameter settings * Branches, so that later parts of a pipeline can be run with different parameters while sharing results from an earlier stage, or running the same analysis on inputs generated using different methods * Running jobs on a cluster / integration with LSF * Logging runs and keeping track of results * Keeping track of default/standard parameters and recording documentation of options * Tracking success of stages and automatically re-running if necessary * Tracking program/data dependencies and versions * Control / reporting through a web interface as well as the command line

Obviously all of this can be done with shell/perl/python/etc. scripts, and with make/Ant (although I think that there is more to this than managing dependencies in the manner that make is usually used). However, there's a tradeoff between the simplicity of a static script vs. flexibility, and it seems like writing a generic system with such features is not an entirely trivial exercise. Making specialized scripts for each pipeline is an option, but risks degenerating into a mess of special cases and gets harder to document and control.

One very lightweight tool I came across that looks interesting and seems to have some of this functionality is ruffus http://code.google.com/p/ruffus/ I'm surprised that there haven't been more efforts like that.

What other approaches and tools do you like to use?