Project Leads:
● Sebastian Beyvers <
● Jannis Schlegel <
The growing complexity and volume of life science research data present major challenges to managing, sharing, and reusing data across modern storage infrastructures. While enabling collaborative research, data sovereignty and institutional control over datasets have become increasingly important. Moreover, adherence to the FAIR Principles (findable, accessible, interoperable, and reusable) has become a necessity for modern data storage. However, many current storage systems, including federated infrastructures, lack comprehensive support for standardized metadata descriptions. This creates barriers to effective data discovery and cross-platform interoperability.
Research Object Crates (RO-Crates) are a lightweight, community-driven method of packaging research data alongside rich, structured metadata using JSON-LD and Schema.org vocabularies. Currently, most RO-Crate implementations focus on the attached format, in which the metadata and data are packaged together. Others are built only to parse the JSON-LD metadata itself. The detached format has limited support because not all datasets are present at the current location, which complicates data access and validation. Additionally, existing tools lack sophisticated pagination mechanisms when dealing with many files and do not emphasize ingestion workflows for detached formats. They also struggle with scalability, large dataset handling, and distributed metadata management when deployed in federated storage environments, limiting their practical application in modern research infrastructures. Thus, improved library support is necessary to address large-scale, distributed research scenarios.
These limitations will be mitigated by improving the RO-Crate tooling to better support federated storage architectures and ingestion processes. These processes will be able to automatically extract metadata and populate search indices across distributed storage nodes. The improved tooling will support attached and detached RO-Crate formats, as well as paginated RO-Crates. It will enable large-scale research objects that exceed traditional packaging limits. This approach enables researchers to maintain comprehensive metadata descriptions while accommodating the practical constraints of distributed storage systems and network transfer limitations in federated environments. The project will establish guidelines for handling very large datasets within RO-Crate frameworks and address scalability challenges specific to life science data, which may include genomic sequences, high-resolution imaging datasets, proteomics data, and longitudinal experimental collections. Implementation will target federated data storage solutions, such as the Aruna platform, which is used by various NFDI consortia and will serve as the primary testbed for validating enhanced RO-Crate workflows within distributed storage environments.
This project directly supports the strategic vision of ELIXIR-DE and de.NBI by advancing interoperable bioinformatics infrastructures that facilitate cross-border data sharing and collaborative research within the European life sciences community. The enhanced RO-Crate implementation aligns with ELIXIR's commitment to developing standardized data management practices and metadata frameworks that enable seamless integration across distributed computational resources. By focusing on federated storage architectures and FAIR data principles, the initiative contributes to de.NBI's mission of providing sustainable, scalable bioinformatics services that can accommodate the growing data volumes characteristic of modern life science research. Furthermore, the project's emphasis on interoperable metadata management and distributed data ecosystems directly supports the European Open Science Cloud (EOSC) objectives of creating a federated, cross-disciplinary research infrastructure. The implementation of paginated RO-Crates and improved ingestion workflows will enhance the technical foundation necessary for EOSC's vision of seamless data discovery and access across European research institutions, while maintaining the data sovereignty and institutional control requirements essential for sensitive biological datasets.