Data Discoverability
The FAIR Principles define findability, accessibility, interoperability, and reusability as the core requirements for research data. Making a dataset findable and interoperable depends on describing and locating it as an object in a discovery system: metadata standards that make its structure interpretable, persistent identifiers that make it citable and reliably locatable, and registries that make the landscape of standards and repositories itself discoverable. A separate layer describes what the data records about the world, its anatomy, phenotypes, and diseases, through ontologies and controlled vocabularies, covered in the Ontologies perspective.
Dataset and catalogue metadata
Metadata is the structured description of a dataset: what it contains, who produced it, when, in what format, and under what licence. Without it a deposited dataset is just a folder of files that a stranger, or a search engine, cannot interpret or reliably find. Metadata standards fix a shared set of fields and their meanings, so that description is consistent and machine-readable across repositories. They work at two levels: a generic base vocabulary, and dataset-specific profiles built on top of it.
Dublin Core is the base layer, 15 generic elements (title, creator, date, format, identifier, and related fields) used for any kind of resource. DCAT (Data Catalog Vocabulary) sits above it, extending Dublin Core specifically to describe datasets and the catalogues that list them, in a form designed to be harvested across systems. This is what enables federated discovery: a dataset described with DCAT in Recherche Data Gouv is automatically findable via the EOSC portal without any additional registration. DCAT-AP, its European application profile, is the mandated form for all EU public sector data portals.
These vocabularies describe the dataset as an object. The investigation that produced it is described by OBI (Ontology for Biomedical Investigations), which provides terms for study design, protocols, instruments, and assays, so the experimental context travels with the data.
Persistent identifiers
A persistent identifier is a stable, permanent reference to a specific entity (a researcher, an organisation, a resource, or a research output), so that it stays citable and machine-resolvable even if names, URLs, or affiliations change. Four schemes cover the main actors and outputs of research.
- ORCID identifies individual researchers, linking outputs to their producers across repositories, funders, and publishers.
- ROR (Research Organization Registry) identifies research organisations, enabling reliable and machine-readable affiliation metadata.
- RRID (Research Resource Identifiers), maintained by the NIF (Neuroscience Information Framework), identifies tools, databases, antibodies, cell lines, and service platforms, enabling unambiguous citation of specific resources in methods sections. RRIDs are mandated by hundreds of journals as of 2024, including Nature, Science, Cell, and eLife.
- DataCite is the international DOI registration agency for research data and software, with the DataCite Metadata Schema defining how datasets and code deposits are described when a DOI is minted.
Registries
The standards, repositories, and policies described above are themselves scattered, so registries index them: catalogues that make the infrastructure landscape discoverable, so a researcher can find the right standard or repository rather than having to already know it exists. Three of them index different layers of that landscape. FAIRsharing provides curated, interlinked descriptions of data standards, repositories, and data policies from journals and funders, making the infrastructure landscape navigable and cross-referenced. It is an ELIXIR Recommended Interoperability Resource, integrated with data management planning tools. re3data indexes over 3,000 repositories across all research disciplines as of 2024, with filterable metadata covering subject area, access conditions, persistent identifier support, and certification status. OpenAIRE operates the OpenAIRE Research Graph, an open knowledge graph covering approximately 217 million publications and 98 million research data objects from over 160,000 sources as of 2025, all linked to funding information and ORCID identifiers. OpenAIRE also monitors Horizon Europe open access compliance and operates Zenodo as its flagship general-purpose repository.
For the ontologies and controlled vocabularies that describe what the data is about, see Ontologies. For the provenance layer, which records how data was produced and by which pipeline, see the Reproducibility perspective. For where to deposit data, including generalist repositories, see Sharing your data.

