Methods
Data sources, transformations, scoring rules, pipeline architecture, and limitations for the GENARCH atlas.
Data Sources
- GWAS Catalog — variant-disease associations
- GTEx — tissue-specific eQTL and expression
- Literature curation — exposure modifiers, mechanism hypotheses
- EPA, NLCD — environmental exposure proxies
- State/BRFSS — health burden estimates
Transformations
Raw data are harmonized to common ontologies (ICD-11, exposure taxonomies). Gene symbols and pathway mappings are standardized. Evidence is scored using predefined rules (see below).
Scoring Rules
Confidence tiers (low / medium / high) reflect evidence strength:
- High: replicated findings, multiple evidence types, multi-ancestry
- Medium: consistent single-ancestry or limited replication
- Low: suggestive associations, limited validation
Pipeline Architecture
The GENARCH pipeline ingests curated data, applies scoring rules, and produces structured JSON outputs for the web atlas. Pipeline runs are versioned and documented in the updates section.
Pipeline code and configuration are available in the project repository.
Limitations
- GWAS and eQTL data remain predominantly European-ancestry
- Exposure proxies may not capture true biological exposure
- Mechanism briefs are hypothesis-driven, not validated causal models
- Community models have geographic and temporal limitations