Methods

Data sources, transformations, scoring rules, pipeline architecture, and limitations for the GENARCH atlas.

Data Sources

  • GWAS Catalog — variant-disease associations
  • GTEx — tissue-specific eQTL and expression
  • Literature curation — exposure modifiers, mechanism hypotheses
  • EPA, NLCD — environmental exposure proxies
  • State/BRFSS — health burden estimates

Transformations

Raw data are harmonized to common ontologies (ICD-11, exposure taxonomies). Gene symbols and pathway mappings are standardized. Evidence is scored using predefined rules (see below).

Scoring Rules

Confidence tiers (low / medium / high) reflect evidence strength:

  • High: replicated findings, multiple evidence types, multi-ancestry
  • Medium: consistent single-ancestry or limited replication
  • Low: suggestive associations, limited validation

Pipeline Architecture

The GENARCH pipeline ingests curated data, applies scoring rules, and produces structured JSON outputs for the web atlas. Pipeline runs are versioned and documented in the updates section.

Pipeline code and configuration are available in the project repository.

Limitations

  • GWAS and eQTL data remain predominantly European-ancestry
  • Exposure proxies may not capture true biological exposure
  • Mechanism briefs are hypothesis-driven, not validated causal models
  • Community models have geographic and temporal limitations