Some of the kinds of features that would be nice are: * Running jobs multiple times with different parameter settings * Branches, so that later parts of a pipeline can be run with different parameters while sharing results from an earlier stage, or running the same analysis on inputs generated using different methods * Running jobs on a cluster / integration with LSF * Logging runs and keeping track of results * Keeping track of default/standard parameters and recording documentation of options * Tracking success of stages and automatically re-running if necessary * Tracking program/data dependencies and versions * Control / reporting through a web interface as well as the command line
Obviously all of this can be done with shell/perl/python/etc. scripts, and with make/Ant (although I think that there is more to this than managing dependencies in the manner that make is usually used). However, there's a tradeoff between the simplicity of a static script vs. flexibility, and it seems like writing a generic system with such features is not an entirely trivial exercise. Making specialized scripts for each pipeline is an option, but risks degenerating into a mess of special cases and gets harder to document and control.
One very lightweight tool I came across that looks interesting and seems to have some of this functionality is ruffus http://code.google.com/p/ruffus/ I'm surprised that there haven't been more efforts like that.
What other approaches and tools do you like to use?