Leveraging Free Infrastructure and Open Practices to Empower the Creation of Specialized Databases by Non-Experts
Databases are of fundamental importance to research in the life sciences, and the development of custom-made infrastructure, workflows, tools, and guidelines by database teams is common. The life science network de.NBI provides several databases. Aren’t there enough databases yet?
I argue: no, we need much more databases. Discipline-specific repositories fall into the same category; we need more! Every science discipline is working in highly-specialized subfields, each with their own language and interest. Each of those needs a database!
Setting databases up is, however, often costly and difficult. Some researchers opt for collecting data in Excel files, instead, and don’t open up their data collections. Those data collections are also not well integrated into the existing database structure, e.g. not providing cross-references to other databases (and thus not utilizing division of labor positively). This should be changed.
I propose for the 2nd BioHackathon Germany to 1) work out ways to utilize free infrastructure (preferably GitLab) to run a database including a corresponding website, 2) document the identified possibilities and 3) implement this with a concrete example.
The concrete example can be a dataset on collected apparent equilibrium constants of enzyme-catalyzed reactions, thus bridging a wide field from chemicals, enzymes, metabolism, literature references, and physics. The data is partly well curated and ready to be integrated with further databases. Cross-references to Rhea and PubMed / DOIs were created already. However, the database still exists mainly in an Excel file.
The full process of making a file becoming a database includes enabling community curation, documenting curation guidelines and practices, making codes openly available, introducing CI/CD elements for quality control, sharing the actual data openly, giving them persistent identifiers, and more.
When taking the identified practices up, this allows future (and current) databases to run more sustainable, transparent, scalable, and collaborative, and thus encourages other researchers to use and contribute to those databases.
Concretely, it enables databases to become more easily 1) scalable (e.g. by opening up the curation to communities instead of individual curators), 2) forkable (i.e. to allow separate databases to be formed instead of allowing scope-creep), 3) integratable (e.g. by allowing interested third parties to build applications on top of the then-open database), 4) sustainable (e.g. by reducing running costs), 5) transparent (i.e. allowing investigation of the working behind the scenes) and 6) collaborative (e.g. by default-remote teams and making users becoming providers).
Project lead: Robert Giessmann <