ioProc

ioProc is a light-weight workflow manager for Python ensuring robust, scalable and reproducible data pipelines. The tool is developed at the German Aerospace Center (DLR) for and in the scientific context of energy systems analysis, however, it is widely applicable in other scientific fields.

4
contributors

Cite this software

What ioProc can do for you

ioProc is a workflow manager, focused on reproducibility, transparency, reusability and maintainability. This focus on core principles of FAIR and scientific research values, distinguishes ioProc from other tools of its kind and helps understanding many of its design decisions.

From a technical point of view, ioProc is a workflow manager operating on the library level. As typical workflow managers like snakemake, are focusing on facilitating workflows on the application level, ioProc focuses on creating workflows on the function library level.

It allows for flexible combination of python functions (adherent to the ioProc specifications) into workflows, which are declared via YAML files.

It thereby serves the same purpose as application level workflow tools but on the level of software and is especially useful for data processing pipelines.

What is a workflow manager for science?

Scientific workflow, especially in the early stages, are traditionally either scripts or Jupyter notebooks. Both formats share a linear and sequential workflow (if special concurrent or parallelizing libraries are absent). These scientific workflows usually focus on one specific goal like creating the data analysis for a publication. Sharing these scientific artifacts between scientists is usually facilitated by sharing the document in its entirety. Reuse of these scripts and notebooks is frequently a modification and extension of the original source code by others to adapt to new applications.

These extensions are not backwards compatible and especially difficult to consolidate or integrate with other changes in the same file but for different purposes.

ioProc tries to alleviate this issue by providing a minimal framework for scientists to write their source code in and gaining for some intellectual overhead, benefits in reproducibility, maintainability, reusability and transparency.

The main benefits are:

  • ioProc encourages to write small distinct functions, with clear interfaces and purposes. This makes reuse of partial elements easy and helps with maintenance.

  • ioProc workflows come with a static configuration, which defines the workflow in its entirety. This static configuration provides all the meta data information for a given workflow and, since it is written in YAML, it is completely parsable and processable by computers. This greatly benefits the transparency and reproducibility of the source code.

  • ioProc workflows consist of a list of actions, processed sequentially. With plugins like ioprocmeta, additional metadata processing support is enabled and can be used to read, merge and collect metadata with minimal overhead during the workflow.

  • ioProc stores actions locally and has no lock-in effect, since proper declaration of ioProc actions allows for use of the same code without any ioProc installation side by side with an ioProc workflow.

  • Last but not least, ioProc workflows and their actions can be reproduced by any scientist with a ready python environment, solely based on the action source code and the workflow specification. Working groups can hence share actions and workflows via systems like git and reproduce, collaborate and enhance the work of their peers.

Background: scientific software development (RSE)

ioProc is written by RSEs for RSEs and scientists which do data analysis and exploration in a cooperative environment with others.

In scientific software engineering the targets of the development process and forefront quality aspects are different from other SE contexts. Most of the time, reproducibility, transparency and maintainability are valued highly especially when FAIR principles are of primary concern.

Since scientific software development is always also part of the investigation and exploration of scientific questions. By writing source code we scientists learn how a problem can be structured, abstracted, simplified or expressed and thus we learn about the nature of the problem. This new gained knowledge in turn influences short- and long-term goals for the software, the required features and the applied methods and algorithms.

Hence research software development is intrinsically an interactive process with an open end. It starts with the scientific questions of the researchers and ends with the last scientific question that is asked by the researchers.

Keywords
Programming languages
  • Python 99%
  • Makefile 1%
  • Batchfile 1%
License
  • MIT
</>Source code
Packages
gitlab.com
pypi.org
github.com

Participating organisations

German Aerospace Center (DLR)

Contributors