DataFlow

A comprehensive framework for automated data processing and manual postprocessing, including quality control and flagging. Modular and extensible design allows to create customizable pipelines and monitor their performance throughout the data lifecycle.

3
contributors
208 commitsLast commit ≈ 1 week ago0 stars0 forks

Description

Features

  • Build customizable data processing pipelines by extending predefined modules for each workflow step (e.g., data ingestion, plausibility checks, transformations)
  • Link pipelines to designated directories to automatically process newly incoming data
  • Automatically execute the corresponding pipeline as soon as new data is detected
  • Monitor pipeline performance, status, and logs in real time
  • Access and inspect imported data stored in InfluxDB, with integrated tools for quality control, data mutation and flagging
  • Manage the entire process from pipeline configuration to data quality assessment via an UI

Technology Stack

  • Frontend: React, Next.js, TailwindCSS, shadcdn
  • Main API: Python, FastAPI
  • WebSocket API: Python, FastAPI
  • Migration API: Python, FastAPI
  • Workflow Trigger: Python

All components of the infrastructure are provided as Docker images for easy deployment and reproducibility.

Required Technical Infrastructure

In addition to the previously mentioned services, the setup of DataFlow requires the following services:

  • InfluxDB
  • Grafana
  • PostgreSQL
  • AWS or MinIO
  • NGINX (recommended)

A version of DataFlow optimized for integration into existing infrastructures is available in the standalone branch of the repository.

The setup can be customized via the .env file there, which allows you to specify which services are already running and which should be deployed during deployment.

Logo of DataFlow
Keywords
Programming languages
  • TSX 55%
  • Python 38%
  • TypeScript 5%
  • C++ 2%
  • Dockerfile 0%
License
</>Source code

Participating organisations

Karlsruhe Institute of Technology (KIT)

Contributors

JS
Jasper Schalla
RG
Rainer Gasche
Karlsruhe Insititue of Technology, IMK-IFU
RK
Ralf Kiese
Albert-Ludwigs-Universität Freiburg

Related projects

DataHub

DataHub Initiative of the Research Field Earth and Environment

Updated 19 months ago