Genomics and Single-Cell Data

Genomic sequencing produces data at three successive levels, each with its own formats and repositories. Raw reads (FASTQ) are the direct sequencer output and are deposited in open archives (ENA, DDBJ, SRA) or in controlled-access repositories (EGA, dbGaP) for sensitive human data. Aligned reads (SAM-BAM-CRAM) are produced by mapping FASTQ reads to a reference genome and are the working format for all downstream analyses. Human BAM/CRAM is typically archived in EGA or dbGaP under controlled access. Variants (VCF) are derived from aligned reads by variant calling and represent an anonymised, aggregate summary that can usually be shared openly. They are deposited in EVA (Europe) or dbSNP (US). VRS (Variant Representation Specification) is the GA4GH standard for computationally precise, globally unique variant identifiers that complement VCF notation across genome builds. For expression studies, the endpoint is count matrices rather than VCF, deposited in NCBI GEO (open) or EGA (controlled access). All of these are governed by GA4GH standards for data access and interoperability. Single-cell data uses AnnData as the exchange format and Cell Ontology for cell type annotation, with BICAN providing the international cell atlas reference and CELLxGENE the primary open portal for exploration and download.