Methods
Data sources, transformations, scoring rules, pipeline architecture, and limitations for the GENARCH atlas.
Data Sources
GWAS Catalog
Variant-disease associations from the EBI GWAS Catalog, filtered by genome-wide significance (p < 5 × 10−8). Accessed via bulk download. Associations are curated per disease module, with ancestry composition and replication status annotated for each top locus.
GTEx Portal (v8)
Tissue-specific gene expression (TPM) and cis-eQTL data from the Genotype-Tissue Expression project, version 8. Used to annotate tissue context and regulatory variant effects. Multi-tissue expression profiles inform tissue relevance scoring for each gene–disease association.
CDC PLACES
County and census-tract level health prevalence estimates from the CDC PLACES dataset (2023 release). Used for community module health burden metrics, including asthma prevalence, cardiovascular disease estimates, and mental health indicators at sub-county resolution.
EPA AQS / AirNow
Air quality monitoring data including annual mean PM2.5 concentrations from Federal Reference Method monitors within the EPA Air Quality System. Used for exposure characterization and community exposure layers. Monitor-level data is spatially interpolated for county-level estimates.
USDA Food Access Research Atlas
Census-tract level food access indicators (2023 release). Used for food desert mapping in the community module, including low-access tract percentages and distance-to-supermarket metrics.
KEGG / Reactome
Biological pathway databases used for gene–pathway membership annotation. KEGG pathway identifiers (e.g., hsa04064 for NF-kB) and Reactome stable IDs provide canonical pathway definitions. Gene membership is verified against current database releases.
Literature Curation
Manual review of peer-reviewed publications for exposure modifiers, mechanism hypotheses, and gene–environment interaction evidence. Each curated claim is tagged with a citation ID traceable to the references section of the relevant entity page. Curation prioritizes systematic reviews, meta-analyses, and large cohort studies.
Pipeline Architecture
The GENARCH data pipeline is a 6-stage ETL system implemented in Python 3.11+ with Pydantic validation at every stage. The pipeline is deterministic: identical inputs produce identical outputs.
Stage 1 — Ingest
Parse source files (CSV, TSV, JSON) from pipeline/sources/ into standardized Pydantic models. Every source file has a manifest entry recording origin URL, download date, license, and format.
Stage 2 — Normalize
Map gene symbols to HGNC official nomenclature. Standardize variant IDs to rsID format. Slugify disease and exposure names for URL-safe identifiers. Generate citation IDs using author-year-suffix convention.
Stage 3 — Annotate
Enrich entities with curated context: variant-to-gene mapping, tissue annotations derived from GTEx expression data, and pathway membership from KEGG/Reactome. This curated annotation layer encodes biological reasoning that connects statistical signals to molecular mechanisms.
Stage 4 — Score
Compute strength (0.0–1.0) and confidence (low/medium/high) per edge using the scoring formula described below. Scores are deterministic functions of the evidence signals available for each association.
Stage 5 — Emit
Serialize validated entities to JSON in the data/ directory. Assemble the knowledge graph (graph.json) from all entity relationships, with full edge attribute metadata.
Stage 6 — Validate
Blocking validation gate. Enforces JSON schema conformance, cross-link integrity (all referenced slugs resolve), citation existence, slug-filename consistency, completeness checks, and graph integrity (no orphan edges). Pipeline fails if any validation check does not pass.
All pipeline runs are versioned and documented in the Updates section.
Strength Scoring Algorithm
Every edge in the knowledge graph carries a strength score (0.0–1.0) computed as a weighted sum of evidence signals:
strength = 0.4 × gwas_signal + 0.25 × eqtl_signal + 0.2 × pathway_membership + 0.15 × literature_support
gwas_signal: Derived from −log10(p-value), capped at 20 and scaled to [0, 1]. A GWAS association at p = 10−8 yields gwas_signal = 0.4; at p = 10−20 or below, gwas_signal = 1.0.
eqtl_signal: Binary (0 or 1) based on presence of a significant cis-eQTL in the relevant tissue from GTEx v8. A gene with a confirmed regulatory variant in the disease-relevant tissue scores 1.
pathway_membership: Binary (0 or 1) based on curated pathway inclusion in KEGG or Reactome for a pathway linked to the disease mechanism.
literature_support: Manual rating (0, 0.5, or 1.0) assigned during curation based on quality and quantity of supporting publications. A score of 1.0 requires multiple independent publications with concordant findings.
Confidence Rating Rules
Confidence is computed from evidence convergence, not subjective judgment:
HIGH
Supported by two or more independent evidence types (e.g., GWAS + eQTL, or GWAS + pathway + literature). Replicated in at least one independent cohort or cross-ancestry study. Tissue specificity confirmed by expression data.
MEDIUM
Supported by one primary evidence type with corroborating literature. Not yet replicated cross-ancestry. Tissue context inferred but not directly confirmed by expression data.
LOW
Single evidence source. Inferred from pathway membership or literature review without direct statistical support. No tissue confirmation. Flagged as hypothesis-generating.
Edge Attributes
Every relationship in the knowledge graph carries mandatory attributes for traceability and reproducibility:
| Attribute | Values | Description |
|---|---|---|
| evidence_type | GWAS | eQTL | literature | pathway | inferred | Primary evidence source for the relationship |
| direction | amplify | buffer | unknown | Whether the exposure/gene amplifies or buffers disease risk at population level |
| tissue | tissue/cell type name(s) | Tissue or cell type(s) where the relationship is observed |
| strength | 0.0–1.0 | Composite score computed per scoring algorithm above |
| confidence | low | medium | high | Evidence convergence rating per confidence rules above |
| sources | citation IDs | Reference identifiers for traceability to primary literature |
Validation Checks
The pipeline validation stage (Stage 6) enforces the following checks as a blocking gate before data is published:
- Schema validation: All JSON outputs conform to defined schemas (Pydantic models with
extra="forbid"). - Cross-link integrity: Every referenced slug (disease, exposure, gene, pathway) resolves to an existing entity file.
- Citation existence: Every citation ID referenced in evidence tables, edges, and briefs maps to a formatted reference entry.
- Completeness: Every disease requires at least one exposure modifier, one top locus, one tissue entry, and non-empty population equity notes.
- Graph integrity: No orphan edges (every edge source and target must exist as a node in the graph).
- Brief frontmatter validity: All mechanism brief files pass schema checks on required fields (slug, title, question, related_disease, related_exposure, references).
Limitations
GWAS and eQTL data remain predominantly European-ancestry; transferability to non-European populations is limited and explicitly noted per disease module in the Population Equity Notes section.
Exposure proxies (e.g., county-level PM2.5 annual means from fixed monitors) may not capture individual-level biological exposure. Spatial interpolation introduces uncertainty, and temporal averaging masks peak exposure events.
Mechanism briefs are hypothesis-driven syntheses of existing evidence, not experimentally validated causal models. They represent the current state of knowledge and are subject to revision.
Community module models are trained on state and regional data with geographic and temporal limitations. The ecological fallacy applies: area-level associations do not imply individual-level effects.
Strength scores are composite indices for relative comparison within the atlas, not absolute effect sizes suitable for clinical or regulatory decision-making.
Literature curation reflects available publications and may be subject to selection bias, publication bias, and temporal gaps between primary research and atlas incorporation.