Methods

Data sources, transformations, scoring rules, pipeline architecture, and limitations for the GENARCH atlas.

Data Sources

GWAS Catalog — variant-disease associations
GTEx — tissue-specific eQTL and expression
Literature curation — exposure modifiers, mechanism hypotheses
EPA, NLCD — environmental exposure proxies
State/BRFSS — health burden estimates

Transformations

Raw data are harmonized to common ontologies (ICD-11, exposure taxonomies). Gene symbols and pathway mappings are standardized. Evidence is scored using predefined rules (see below).

Scoring Rules

Confidence tiers (low / medium / high) reflect evidence strength:

High: replicated findings, multiple evidence types, multi-ancestry
Medium: consistent single-ancestry or limited replication
Low: suggestive associations, limited validation

Pipeline Architecture

The GENARCH pipeline ingests curated data, applies scoring rules, and produces structured JSON outputs for the web atlas. Pipeline runs are versioned and documented in the updates section.

Pipeline code and configuration are available in the project repository.

Limitations

GWAS and eQTL data remain predominantly European-ancestry
Exposure proxies may not capture true biological exposure
Mechanism briefs are hypothesis-driven, not validated causal models
Community models have geographic and temporal limitations