Ctrl K

DataNord - SmartReps AI Hub

DataNord - SmartReps AI Hub is a centralised platform for in-house AI-powered solutions designed to enrich PANGAEA ecosystem. This page highlights innovative projects that leverage artificial intelligence to streamline workflows, deliver intelligent insights, and advance cutting-edge research.

Description

1. Chatbots for interactive user guidance for data input for PANGAEA & Qualiservice Ecosystem

๐Ÿ’ฌ ChatPANGAEA

ChatPANGAEA is an AI-driven chatbot developed within the SmartReps project and the DataNord Research Academy. It is tailored to support the PANGAEA data ecosystem by making its complex internal knowledge base especially wiki pages and guidelines easily searchable and interactive.

Using a combination of Retrieval-Augmented Generation (RAG), Information Retrieval, and Large Language Models (LLMs), ChatPANGAEA enables users to engage in real-time conversations that provide accurate, contextual answers. Each response is backed by references to the original sources, ensuring traceability and trust.

In addition to searching internal resources, users can upload their own documents or reference external websites, allowing for broader, personalized support during the data submission or curation process.

ChatPANGAEA is particularly valuable for data curators and researchers, as it reduces time spent navigating documentation, clarifies data workflows, and improves the quality and consistency of curated datasets within the PANGAEA infrastructure.

๐Ÿ’ฌ ChatQualiservice

ChatQualiservice is a domain-specific AI assistant created to support researchers and data curators working with qualitative research data at Qualiservice. Developed under the SmartReps project and supported by the DataNord Research Academy, it aims to simplify access to submission guidelines, ethical protocols, data protection rules, and internal documentation.

By leveraging RAG based architecture and LLMs, ChatQualiservice delivers real-time, context aware answers directly from trusted documentation sources. This ensures users receive not only fast but also reliable responses to their questions, minimizing the need to manually search through complex materials.

It supports interaction with internal documents and user-uploaded content, helping users clarify their responsibilities, understand repository standards, and streamline the process of preparing qualitative data for submission.

ChatQualiservice empowers social science researchers and data stewards by improving efficiency, reducing uncertainty, and fostering a more user-friendly approach to managing sensitive and complex datasets.

2. Semi-Automatic Generation of Abstracts for PANGAEA Ecosystem

๐Ÿ’ก PANscribe

PANscribe is a semi-automated abstract generation tool developed as part of the DataNord Research Academy. It supports the PANGAEA data infrastructure by helping generate structured dataset abstracts aligned with submission guidelines.
The tool leverages Large Language Models (LLMs) with prompt engineering methodology to create draft abstracts from available metadata and uploaded documents, streamlining the authoring process for data curators and researchers. A built-in validation step checks whether key components such as "who, what, where, when, why, how" are present, ensuring abstracts meet quality expectations.

PANscribe enhances submission workflows by reducing manual effort, supporting consistency, and encouraging best practices in metadata documentation across the PANGAEA ecosystem.

3. Keyword Extraction Pipeline using Prompt Engineering and Information Retrieval for Study Report Analysis

๐Ÿงพ The LLM-enhanced Keyword Extraction Pipeline is a research-grade system for automated analysis and classification of academic study reports. Developed within the SmartReps project of the DataNord Research Academy, it supports structured extraction of keywords from unstructured PDF documents and their alignment with controlled vocabularies for research data standardization.

The system combines large language models with classical NLP techniques to extract theme-aware keywords from multilingual documents (German and English). Extracted terms are processed through a hybrid matching pipeline that integrates exact string matching and semantic similarity search using SentenceTransformers and FAISS indexing. Identified keywords are mapped to the TheSoz/GESIS Thesaurus for the Social Sciences to ensure consistent and interoperable classification.

The pipeline includes OCR-based PDF processing (Tesseract), text cleaning, language detection, and chunk-based processing for large documents. A semantic deduplication step reduces redundancy in extracted terms, while confidence scoring supports ranking and transparency of results. Outputs are provided in structured JSON and CSV formats, accompanied by a full audit trail for reproducibility.

The system is designed for transparent and reproducible research workflows, enabling efficient metadata generation and improving consistency in research data annotation within infrastructures such as PANGAEA and Qualiservice

All the projects are developed as part of the SmartReps project under the DataNord Research Academy, empowers data curators with faster, more accurate, and context-specific solutions, helping to ensure the quality and accessibility of curated data in PANGAEA ecosystem, Qualiservice data sharing.

Participating organisations

Alfred Wegener Institute for Polar and Marine Research (AWI)

Testimonials

The chat functionality is pretty awesome! Great work!
โ€“ PANGAEA Project & Data Management
The chatbot is working quite nice :)
โ€“ PANGAEA Editorial
I have just tested the Chatbot. And most answers look very good so far.
โ€“ PANGAEA Project & Data Management