Zenodo serves as an open repository for all research outputs, governed by clear policies.
Its scope includes all fields of research and all types of research artifacts, from any stage of the research lifecycle. Anyone may register as a user, and all users are allowed to deposit content for which they possess the appropriate rights. A key operational point is that by uploading content, no change of ownership is implied and all uploaded content remains the property of the parties prior to submission. The repository accepts all data file formats, even those considered preservation unfriendly, with a total file size limit of 50GB per record.
Files deposited on Zenodo can have different levels of accessibility
Users may specify a license for all publicly available files and can deposit files under closed, open, or embargoed access. For embargoed status, the repository will restrict access to the data until a provided end date, after which the content becomes publicly available automatically. In all cases, the metadata for records is licensed under CC0 and is always publicly accessible.
The platform ensures long-term preservation through specific technical measures.
All data files are stored in CERN Data Centres, primarily in Geneva, with replicas in Budapest, and are kept in multiple replicas in a distributed file system. Items will be retained for the lifetime of the repository, which is currently tied to the host laboratory CERN and its experimental programme defined for the next 20 years at least. If the repository were to close, best efforts would be made to integrate all content into suitable alternative institutional and/or subject based repositories.
Uploading research to Zenodo is a structured process.
Users begin by creating a new upload and can add files by clicking an upload button or using drag and drop, with a limit of up to 100 files and a total of 50GB. They must then fill in minimal required metadata fields, including resource type, title, publication date, and creators. A critical step is managing the Digital Object Identifier (DOI); if the upload already has a DOI, it must be declared, but if not, one can be reserved through the platform. After setting the visibility (public or restricted) and optionally applying an embargo, the user can publish the record.
Research communities, such as the SEARRP, implement specific curation policies on Zenodo.
Researchers are required to deposit their datasets and metadata within twelve months of data collection, or immediately post-publication. The metadata for submitted data will be publicly available, and there are several routes to publication: linking to already open access data, submitting as new open access, using Zenodo’s embargo mechanism for up to two years, or setting terms and conditions for restricted sharing. Users of open access data from such communities are expected to cite the relevant researchers and, for certain intensive datasets, include the data collectors as co-authors on manuscripts.
Extracting data from digital sources for repositories is a distinct challenge, often addressed with specialized tools.
Web scraping tools are designed to automate data extraction from websites. A key consideration when choosing a tool is its ability to handle JavaScript-heavy websites, CAPTCHAs, IP bans, and large-scale tasks. Solutions range from full APIs like ScraperAPI, which handles proxies and CAPTCHAs, to no-code browser extensions like Web Scraper, which uses a point-and-click interface and can export data in CSV, XLSX, and JSON formats. Other platforms, like Browse AI, offer AI-powered scraping and monitoring with features like human behavior emulation and geo-based data extraction.
For data trapped within documents, AI-powered extraction software provides a solution.
Tools like Parseur use AI to automatically convert PDFs, emails, and images into structured data, aiming to save up to 98% on manual data entry costs. Similarly, Amazon Textract is a machine learning service that goes beyond simple optical character recognition (OCR) to automatically extract text, handwriting, layout elements, and data from scanned documents.
The automated identification of dataset references in literature is an active area of research.
Researchers note that datasets are critical for replication and reproducibility, but citing them is not yet a common or standard practice, which affects our ability to track their usage. Automated systems like Data Gatherer leverage large language models to identify and extract dataset references from scientific publications, aiming to reduce the time required for dataset discovery. Other research focuses on using specific neural network models, such as a Bi-LSTM-CRF architecture, to achieve automatic dataset mention extraction from scientific articles.
📚 References
- Zenodo. Policies
- Marini, P., Santos, A., Contaxis, N., & Freire, J. (2025). Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). Association for Computational Linguistics.
- ScraperAPI. 16 Best Web Scraping Tools In 2025 (Pros, Cons, Pricing)
- Zeng, T. (2024). Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model. arXiv.
- Web Scraper. Web Scraper – The #1 web scraping extension
- Zenodo. Create new upload
- Parseur. AI data extraction software | Parseur®
- Browse AI. Browse AI: Scrape and Monitor Data from Any Website with …
- Zenodo. Curation policy for the SEARRP Community
- Amazon Web Services. Amazon Textract