Guest posting: Conda as a new standard for Galaxy tool dependencies

Björn Grüning

Björn Grüning modelling one of our t-shirts.

Nowadays, massive amounts of diverse data are generated in biomedical research. To manage it and extract useful information, bioinformatic solutions are needed and software must be developed. The development of a tool should always follow a similar process. First, to solve a scientific question or a need, some source code is developed, that can be distributed as it is. To help its deployment and simplify the tool installation, the code is packaged in various package formats. The tool behind the code is then deployed and used by the targeted users. Ideally, documentation, training and support are also provided to help users spread the solution and advertise it.

This process, from development to support, is the golden path to develop a tool for wider adoption and reuse. But issues for deployment and sustainability of the tool are found for many bioinformatic tools. What bioinformaticians have not dealt with is the situation of missing tool dependencies, or an older version of a tool could not be installed due to various reasons. Deployment and sustainability of tools are therefore a major issue for productivity and reproducibility in science [see GigaScience‘s “Ten recommendations for software engineering in research” for more on this topic].

For deployment issues, we need a package manager that is operating system and programming language agnostic, as bioinformatic tools are developed in a myriad of languages and operating systems, including ancient ones. Moreover all available packages have to be permanently cached to be always reachable and to enable reproducibility.

One community effort to create a flexible, scalable and sustainable system to fix the tool deployment problem once and for all is Bioconda, and here I describe work over the last six months integrating Conda packages as a new standard for tool dependencies in Galaxy. Not only do Conda packages make tool dependencies in Galaxy more reliable and stable, they are also easier to test and faster to develop than previous processes.

Galaxy tools (also called wrappers) traditionally use Tool Shed package recipes to install their dependencies. At the tool’s installation time the recipe is downloaded and executed in order to provide the underlying software executables. Introduction of these Galaxy-specific recipes was a necessary step at the time, however there are now more mature and stable options to install software in a similar manner. The Galaxy community has taken steps to improve the tool dependency system in order to enable new features and expand its reach.

Bioconda is a community that packages bioinformatics software for Conda, an open source package manager. A single conda package recipe can be used on a wide range of operating systems (Windows, Linux, OS X) with no restrictions regarding the tools’ programming language. Installation of Conda packages is fast and robust. No root privileges are required and multiple versions of every software can be installed and managed in parallel. Helped by an extensive documentation, writing a Conda package is very simple, easing community contribution. These packages are also stored long-term in a public repository (Cargo Port), the distribution center of the Galaxy Project, resolving the sustainability issue. Moreover, a technique called “layer donning” has been recently introduced to build Docker containers automatically and very efficiently for all Conda packages.

Development of Conda packages through the Bioconda community eases the packaging and the deployment of bioinformatic tools. The interface with Cargo Port enables sustainability by mirroring all sources. Building efficient Linux containers automatically ensures an even higher layer of abstraction and isolation of the base system. Thanks to these collaborative projects, their community and their collaborations, bioinformatics tools can be easily packaged, deployed and will be available to help biomedical research.

As a community, we have decided that Conda is the one that best fulfils our needs. The following are some of the crucial Conda features that led to this decision:

  • Installation of packages does not require root privileges (installation at any location the Galaxy user has write access to)
  • Multiple versions of software and corresponding dependencies can be installed in parallel
  • Ready for High Performance Computing
  • Faster and more robust package installations through pre-compiled packages (no build environment complications)
  • Independent of programming language (works with R, Perl, Python, Julia, Java, pre-compiled binaries, and more)
  • Easy to write package recipes (1 YAML description file + 1 install script)
  • An active, large and growing community (with more and more software authors managing their own recipes)
  • Extensive documentation: both for Conda documentation and Conda quick-start.

For more technical details we have FAQ documentation on the Galaxy website, and Lance Parsons has collected some of the most common questions (including how to enable Conda for existing Galaxy installations, how do Conda dependencies work, requirements, finding packages, and more). If this list does not solve your problem or you have any trouble following the instructions, please ask on the galaxy mailing list or the IRC channel.

One last thing that is important to me: the entire Conda project is an amazing community effort – not only within the Galaxy community, but we team up with the Conda community as well, in particular with the BioConda folks. Thanks to Johannes Köster, Ryan Dale and the entire community around this package manager which have created around 1500 bioinformatics packages just in the last year.

Thanks to John Chilton who has written the Galaxy Conda integration last December and is pushing tool development in Galaxy to a new level. Many thanks to Nicola Soranzo for his constant support and reviews in all mentioned projects, he is everywhere to help – you will see! Thanks also to Peter van Heusden, Marius van den Beek and Brad Langhorst – they have worked hard on Conda-Galaxy Integration to make it shine. I want to thank Lance Parsons for his questions that inspired me to write this down and his constant support in making tools better in Galaxy – I hope with Conda you will have less pain with tool-dependencies.

Last but not least I would like to thank the entire IUC and the over 50 IUC contributors that have migrated most of the IUC tools to Conda packages over the last 6 months.

Enjoy Galaxy + Conda,

Björn Grüning, on behalf of The Intergalactic Utilities Commission.

For more on the efforts between GigaScience and the Galaxy community see this blog. We also have our GigaGalaxy.net server for presenting and hosting the computational outputs and methods of studies published in GigaScience, as well as our related Galaxy Series. Please contact us if you are interested in submitting your Galaxy related work to us.

http://www.gigasciencejournal.com/series/Galaxy