Standards, Terminologies and Ontologies

This index covers all data format standards, metadata frameworks, terminologies, and ontologies in the graph.

Research data management

Metadata vocabularies, provenance standards, and persistent identifier schemes that enable FAIR data management across all research domains.

  • DCAT (Data Catalog Vocabulary) is the W3C standard that powers discoverability across EOSC and Recherche Data Gouv.
  • Dublin Core provides 15 basic metadata elements widely used as a base metadata layer across repositories including Zenodo and HAL.
  • OBI (Ontology for Biomedical Investigations) provides a formal vocabulary for describing study protocols and experimental designs.
  • PROV-O is the W3C Provenance Ontology and the formal foundation on which NIDM and DataLad provenance tracking are built.
  • RRID (Research Resource Identifiers) are persistent identifiers for reagents, software, and core facilities, governed by NIF.
  • ROR (Research Organization Registry) provides persistent identifiers for research institutions.

Neuroimaging

Data format standards, metadata frameworks, and annotation vocabularies for brain imaging data.

  • BIDS is the Brain Imaging Data Structure, the widely adopted community standard for organising neuroimaging datasets.
  • CIFTI is a surface and volume (greyordinate) format for cortical data developed by the Human Connectome Project.
  • Cognitive Atlas is an ontology of cognitive processes and tasks used by NeuroVault and BIDS for task annotation.
  • DICOM is the standard clinical imaging format and the source format converted to NIfTI.
  • NIfTI (.nii/.nii.gz) is the widely adopted processed neuroimaging format.
  • NIDM is the Neuroimaging Data Model, a PROV-O-based standard for representing neuroimaging experiment provenance.
  • Open Brain Consent provides GDPR-compatible model informed consent forms for open sharing of neuroimaging and electrophysiology participant data, endorsed by INCF.
  • openMINDS is the metadata framework required for data deposited on EBRAINS.
  • UBERON is a cross-species anatomy ontology used for brain region annotation in EBRAINS, NWB, and the Allen Institute for Brain Science.

Bioimaging

File formats and metadata standards for biological microscopy and bioimaging data.

  • OME File Formats covers the two OME file formats: OME-TIFF for archival use and OME-Zarr for cloud-native large datasets.
  • REMBI (Recommended Metadata for Biological Images) is the community metadata framework for bioimaging datasets.
  • SWC is a widely adopted format for three-dimensional neuronal and glial morphology reconstructions, endorsed by INCF in 2024.

Neurophysiology

File formats and annotation standards for electrophysiology, EEG, and computational neuroscience data.

  • BrainVision is the Brain Products three-file EEG format (.vhdr/.vmrk/.eeg), one of the formats accepted by BIDS.
  • EDF (European Data Format) is a widely used format for clinical EEG, iEEG, and polysomnography.
  • HED (Hierarchical Event Descriptors) provides structured event annotation integrated into BIDS and NWB.
  • Neo is an open Python object model and I/O library for electrophysiology data.
  • NeuroML is a simulator-independent XML format for describing computational neuron and network models, endorsed by INCF.
  • NWB (Neurodata Without Borders) is a community standard for electrophysiology and calcium imaging data.
  • SPARC SDS is the SPARC Data Structure, the NIH SPARC programme standard for peripheral nervous system data.

Genomics and single-cell

Sequencing file formats, variant standards, and single-cell data formats covering the pipeline from raw reads through to annotated expression matrices.

  • AnnData is the widely adopted standard format (h5ad) for single-cell genomics data in the Scanpy and scverse ecosystem.
  • Cell Ontology is the OBO Foundry ontology for cell types, required for single-cell data annotation in CELLxGENE and BICAN.
  • FASTQ is the standard format for raw sequencing reads and the primary output of all NGS instruments.
  • GO (Gene Ontology) covers biological process, molecular function, and cellular component and is used in transcriptomics workflows.
  • Phenopackets is the GA4GH standard (ISO/TS 5435) linking clinical phenotypes via HPO to genomic data, supporting both VCF and VRS as variant formats.
  • SAM-BAM-CRAM are the standard aligned sequencing read formats that form the pipeline backbone between FASTQ and VCF.
  • Seurat is the R-ecosystem counterpart to AnnData, providing the standard data object for single-cell RNA-seq analysis in R.
  • VCF (Variant Call Format) is the standard format for genomic variant data, with open-access variants deposited in EVA (Europe) or dbSNP (US).
  • VRS (Variant Representation Specification) is the GA4GH standard for computationally precise, globally unique variant identifiers that complement VCF notation across genome builds.

Clinical data models and interoperability

Data models and exchange standards for structuring, querying, and sharing clinical and health data across systems and institutions.

  • CDISC provides clinical trial data standards (SDTM, ADaM, CDASH) for regulatory submissions.
  • HL7 FHIR (Fast Healthcare Interoperability Resources) is mandated by EHDS for EHR exchange.
  • OMOP CDM is the OHDSI Common Data Model for federated observational health research.
  • openEHR is a semantic EHR specification built around reusable archetypes and templates.

Clinical classification and coding

Terminologies and classification systems for diagnoses, procedures, observations, and research data coding in clinical and health settings.

  • CCAM is the French national procedure classification present in SNDS and AP-HP PMSI billing data.
  • ICD-10 is the WHO disease classification. The French version (CIM-10) is used throughout SNDS and AP-HP billing.
  • ICD-11 is the updated WHO classification in force since 2022. France is currently in transition from ICD-10.
  • ICD-O-3 is the WHO/IARC dual-axis tumour classification for cancer registries, coding both anatomical site and histological type. It is required by OSIRIS and all French cancer registries.
  • LOINC is the international standard for identifying lab tests, biomarkers, and clinical observations.
  • MeSH is the NLM controlled vocabulary (~30,000 descriptors as of 2024) used for PubMed indexing and ClinicalTrials.gov.
  • OSIRIS is the French national minimum dataset for oncology clinical and genomic data sharing, aligned with HL7 FHIR and ICD-O-3, funded by INCa.
  • SNOMED CT is a comprehensive clinical terminology and the core vocabulary in OMOP CDM and HL7 FHIR.

Drug and chemical terminologies

Controlled vocabularies for drugs, chemicals, and adverse events used in pharmacological research and clinical trials.

  • ATC is the WHO Anatomical Therapeutic Chemical classification, the international standard for drug utilisation and an OMOP CDM vocabulary.
  • ChEBI (Chemical Entities of Biological Interest) is the EMBL-EBI ontology covering drugs, metabolites, and neurotransmitters.
  • MedDRA is the international terminology for adverse event coding required in clinical trial regulatory submissions to the EMA and ANSM.
  • NCIT (NCI Thesaurus) is the NCI cancer and clinical research terminology used as a controlled terminology source in CDISC SDTM submissions.
  • RxNorm is the NLM standard for clinical drug names and identifiers and the primary drug vocabulary in OMOP CDM.

Disease, phenotype, and variant curation

Ontologies and reference resources for classifying diseases, annotating phenotypes, and curating the clinical significance of genomic variants.

  • ADO (Alzheimer’s Disease Ontology) covers biomarkers, staging, and genetics relevant to Alzheimer’s cohort data annotation.
  • ClinVar is the NCBI database of clinical variant interpretations and pathogenicity classifications, curated by ClinGen expert panels.
  • ERN Vocabularies are the ERN-RND and ERN-EpiCARE patient registry terminologies, combining ORDO, HPO, and OMOP CDM.
  • HPO (Human Phenotype Ontology) provides over 18,000 phenotypic abnormality terms (as of 2024) and is the primary vocabulary for rare disease genomics.
  • MONDO (Monarch Disease Ontology) harmonises ICD-10, OMIM, and ORDO into a single disease hierarchy.
  • NBO (Neurobehavior Ontology) describes behavioural phenotypes in both humans and model organisms.
  • OMIM (Online Mendelian Inheritance in Man) is a curated compendium of gene-disease relationships, identified by MIM numbers.
  • ORDO (Orphanet Rare Disease Ontology) is the European standard classification for rare neurological diseases.