4th BioHackathon Germany - Detection and Extraction of Datasets for Training Machine Learning Methods from Literature

Co-leads:

Emidio Capriotti (University of Bologna)
Nils Hoffmann (Research Center Jülich)
Foo Wei Ten (BIH Charite, Berlin)

Challenge:
A major bottleneck in developing ML models for life sciences is the lack of structured, high-quality training datasets. While a large amount of data are published in scientific literature, they remain locked in unstructured formats (tables, figures, supplementary files) and are rarely accompanied by full metadata and experimentally generated data in openly accessible repositories. Manual extraction is time-consuming and error-prone, limiting reproducibility and scalability in ML applications. The main challenges can be summarized as follows: Heterogeneous Data Formats: Publications use diverse representations (text, PDFs, Excel files), requiring flexible extraction tools.

Standardization:
Extracted data must comply with FAIR principles (e.g., ontologies, metadata schemas) for ML usability and data interoperability. Validation: Ensuring accuracy and biological relevance of extracted datasets. Planned Activities: Tool Development: Extend and adapt NLP/LLM pipelines to extract structured datasets from publications using suitable toolchains, collaborate with EuropePMC to sync with their annotation pipeline. Develop parsers for common data formats, train extraction, named entity recognition and concept annotation models based on manually annotated reference data. Data Harmonization: Map extracted data to a minimal set of standard identifiers (e.g., UniProt, ChEBI, HGVS) using ELIXIR resources like OmicsDI, and ontologies like EDAM and MeSH. Integrate with ELIXIR’s FAIR data infrastructure (e.g., Bioschemas). ML Benchmarking: Curate benchmark datasets for training ML models (e.g., variant pathogenicity predictors) or for benchmarking and evaluating LLMs.

Study Case 1:
Extraction of Proteomics and Metabolomics Data Goal: Automate extraction of protein-metabolite interactions, quantitative profiles, and biomarker signatures from publications. Approach: Identify publications (CC-BY) for annotation/training corpus. Train NLP models to identify tables/figures containing MS data. Parse and normalize metabolite names using RefMet/Goslin, link metabolites to ChEBI and proteins to UniProt/SwissProt, check for reaction/pathway links via Rhea. Output: Structured datasets for training ML models in disease biomarker discovery.

Study Case 2:
Extraction of Genetic Variants with Pathogenic Effects Goal: Curate a dataset of missense variants with clinical annotations to improve pathogenicity prediction models. Approach: Extract variant-phenotype associations from literature and public databases such as ClinVar/UniProt humsavar. Annotate variants with functional prediction scores. Output: A benchmark dataset for evaluating variant effect predictors.

Study Case 3:
Extraction and Evaluation of Transcript Isoform Information Using LLMs Goal: Leverage large language models (LLMs) to automatically extract, standardize, and evaluate transcript isoform information (e.g., splice variants, functions) from scientific articles as structured outputs.
Approach: Apply LLM-based pipelines to identify and extract isoform-relevant statements from literature (e.g., RNA-seq studies, gene function papers). Map extracted entities to standard reference databases (Ensembl, UniProt, Gene Ontology), using established ontologies. LLMs will also be utilized to perform automated fact-checking, evaluating extracted content for biological consistency and plausibility.
Output: Structured, validated, and FAIR-compliant transcript isoform datasets, mapped to standard identifiers. Datasets can be used to benchmark ML models for isoform function or for evaluating LLM-based extraction methods. Potentially standards definition and prototypes for Model Context Protocol (MCP) servers/clients.

The activities are designed such that we offer opportunities for participants interested in curation, machine learning, semantic enrichment and mapping, as well as in FAIR data. Before the BioHackathon, we will provide at least one preparation meeting with all interested participants to on-board everyone, as well as to organize and distribute the activities. After the biohackathon, we plan to describe our approach and results in a manuscript.

Alignment with de.NBI & ELIXIR-DE:
This project aligns with de.NBI & ELIXIR-DE’s priorities in data-driven life sciences and machine learning, with special relevance to Germany's proteomics and metabolomics research communities. By leveraging tools like Identifiers.org and Bioschemas, we will enhance data integration and interoperability. The project will also bridge literature mining with ML applications to accelerate open science and computational biology across ELIXIR platforms, directly supporting national initiatives in precision medicine and de.NBI & ELIXIR-DE’s activities in MS-based bioinformatics research. It directly involves participants of the ELIXIR-ML implementation study and community.

Events

Latest events around de.NBI

4th BioHackathon Germany - Detection and Extraction of Datasets for Training Machine Learning Methods from Literature