Genomics

Genomic sequencing produces data at successive stages (raw reads, aligned sequences, and processed variants or expression matrices), each with its own formats, repositories, and governance.

Standards

The core sequence and variant formats (SAM-BAM-CRAM, VCF, VRS, Phenopackets) are standardised by GA4GH, which also provides data access and interoperability APIs adopted by controlled-access repositories including EGA and dbGaP.

Data archives

Genomic data deposits follow the processing pipeline. Raw reads in FASTQ format go to the INSDC (International Nucleotide Sequence Database Collaboration) partner archives: ENA (Europe), SRA (USA), and DDBJ (Japan), which synchronise their holdings daily. Human raw reads that cannot be openly released due to re-identification risk are deposited in controlled-access archives instead: EGA (Europe) or dbGaP (USA). Aligned reads in SAM-BAM-CRAM follow the same open or controlled-access frameworks as the raw reads they derive from. Processed variants in VCF can often be shared as anonymised summary statistics, with open-access submissions going to EVA (Europe) or dbSNP (USA). Clinical significance classifications are curated in ClinVar, with expert panel assessments from ClinGen. Expression count matrices from bulk RNA-seq studies are deposited in NCBI GEO (open) or EGA (controlled access).

Federated search and analysis

The GA4GH Beacon specification defines a standardised query interface that allows institutions to expose whether their database contains a given genomic variant without sharing the underlying data. Participating institutions run a Beacon-compliant service locally and respond to queries from external researchers, enabling cross-institutional genomic discovery without moving sensitive data to a central location. BBMRI-ERIC operates a network of Beacon nodes across European biobanks, making population-scale variant frequencies queryable across the network.

The GDI (European Genomic Data Infrastructure) project implements this model at continental scale, deploying a federated infrastructure across 21 EU countries using GA4GH Beacon and DRS APIs, with EGA Federated national nodes as the controlled-access repository backbone. GDI implements the 1+MG Framework, the normative reference document produced by the 1+ Million Genomes (1+MG) initiative covering ELSI, data quality, technical standards, and healthcare integration for national genomics programmes. The Genome of Europe project (2024–2028) is building the primary reference dataset for GDI, generating whole-genome sequences from population cohorts across 20+ European countries coordinated through BBMRI-ERIC national nodes.

Single-cell data

Single-cell genomics differs from bulk sequencing in that each cell is profiled individually, producing a cells-by-features matrix as the primary data object rather than a sequence read file. The processing pipeline, tooling, and archives are therefore distinct from those above.

Single-cell data is analysed in two primary ecosystems: AnnData (Python/scverse) and Seurat (R), with the h5ad file format serving as the common exchange format between them. Cell type annotation uses Cell Ontology, and anatomical provenance of cell populations (brain region and tissue type) uses UBERON. Single-cell datasets are deposited in CELLxGENE, which provides interactive exploration and a programmatic Census API for large-scale cross-dataset analysis, or in NeMO Archive for BRAIN Initiative-funded data.

Notable open datasets

  • BICAN is assembling a multi-resolution mammalian brain cell type atlas from single-cell transcriptomic data, deposited in NeMO Archive and CELLxGENE.
  • UK Biobank includes whole-exome sequencing and array genotyping from 500,000 participants, linked to neuroimaging and phenotypic data, available under controlled access.
  • ENIGMA Consortium coordinates genome-wide association studies integrated with neuroimaging data across hundreds of sites, with summary statistics openly shared.

For the phenotyping standards and variant curation infrastructure relevant to rare neurological disease, see Rare Disease and Phenotyping. For the health data access models and regulatory constraints governing sensitive genomic data, see Health.