Methods

Data sources, transformations, scoring rules, pipeline architecture, and limitations for the GENARCH atlas.

Data Sources

GWAS Catalog

Variant-disease associations from the EBI GWAS Catalog, filtered by genome-wide significance (p < 5 × 10−8). Accessed via bulk download. Associations are curated per disease module, with ancestry composition and replication status annotated for each top locus.

GTEx Portal (v8)

Tissue-specific gene expression (TPM) and cis-eQTL data from the Genotype-Tissue Expression project, version 8. Used to annotate tissue context and regulatory variant effects. Multi-tissue expression profiles inform tissue relevance scoring for each gene–disease association.

CDC PLACES

County and census-tract level health prevalence estimates from the CDC PLACES dataset (2023 release). Used for community module health burden metrics, including asthma prevalence, cardiovascular disease estimates, and mental health indicators at sub-county resolution.

EPA AQS / AirNow

Air quality monitoring data including annual mean PM2.5 concentrations from Federal Reference Method monitors within the EPA Air Quality System. Used for exposure characterization and community exposure layers. Monitor-level data is spatially interpolated for county-level estimates.

USDA Food Access Research Atlas

Census-tract level food access indicators (2023 release). Used for food desert mapping in the community module, including low-access tract percentages and distance-to-supermarket metrics.

KEGG / Reactome

Biological pathway databases used for gene–pathway membership annotation. KEGG pathway identifiers (e.g., hsa04064 for NF-kB) and Reactome stable IDs provide canonical pathway definitions. Gene membership is verified against current database releases.

Literature Curation

Manual review of peer-reviewed publications for exposure modifiers, mechanism hypotheses, and gene–environment interaction evidence. Each curated claim is tagged with a citation ID traceable to the references section of the relevant entity page. Curation prioritizes systematic reviews, meta-analyses, and large cohort studies.

Pipeline Architecture

The GENARCH data pipeline is a 6-stage ETL system implemented in Python 3.11+ with Pydantic validation at every stage. The pipeline is deterministic: identical inputs produce identical outputs.

Stage 1 — Ingest

Parse source files (CSV, TSV, JSON) from pipeline/sources/ into standardized Pydantic models. Every source file has a manifest entry recording origin URL, download date, license, and format.

Stage 2 — Normalize

Map gene symbols to HGNC official nomenclature. Standardize variant IDs to rsID format. Slugify disease and exposure names for URL-safe identifiers. Generate citation IDs using author-year-suffix convention.

Stage 3 — Annotate

Enrich entities with curated context: variant-to-gene mapping, tissue annotations derived from GTEx expression data, and pathway membership from KEGG/Reactome. This curated annotation layer encodes biological reasoning that connects statistical signals to molecular mechanisms.

Stage 4 — Score

Compute strength (0.0–1.0) and confidence (low/medium/high) per edge using the scoring formula described below. Scores are deterministic functions of the evidence signals available for each association.

Stage 5 — Emit

Serialize validated entities to JSON in the data/ directory. Assemble the knowledge graph (graph.json) from all entity relationships, with full edge attribute metadata.

Stage 6 — Validate

Blocking validation gate. Enforces JSON schema conformance, cross-link integrity (all referenced slugs resolve), citation existence, slug-filename consistency, completeness checks, and graph integrity (no orphan edges). Pipeline fails if any validation check does not pass.

All pipeline runs are versioned and documented in the Updates section.

Strength Scoring Algorithm

Every edge in the knowledge graph carries a strength score (0.0–1.0) computed as a weighted sum of evidence signals:

strength = 0.4 × gwas_signal + 0.25 × eqtl_signal + 0.2 × pathway_membership + 0.15 × literature_support

gwas_signal: Derived from −log10(p-value), capped at 20 and scaled to [0, 1]. A GWAS association at p = 10−8 yields gwas_signal = 0.4; at p = 10−20 or below, gwas_signal = 1.0.

eqtl_signal: Binary (0 or 1) based on presence of a significant cis-eQTL in the relevant tissue from GTEx v8. A gene with a confirmed regulatory variant in the disease-relevant tissue scores 1.

pathway_membership: Binary (0 or 1) based on curated pathway inclusion in KEGG or Reactome for a pathway linked to the disease mechanism.

literature_support: Manual rating (0, 0.5, or 1.0) assigned during curation based on quality and quantity of supporting publications. A score of 1.0 requires multiple independent publications with concordant findings.

Confidence Rating Rules

Confidence is computed from evidence convergence, not subjective judgment:

HIGH

Supported by two or more independent evidence types (e.g., GWAS + eQTL, or GWAS + pathway + literature). Replicated in at least one independent cohort or cross-ancestry study. Tissue specificity confirmed by expression data.

MEDIUM

Supported by one primary evidence type with corroborating literature. Not yet replicated cross-ancestry. Tissue context inferred but not directly confirmed by expression data.

LOW

Single evidence source. Inferred from pathway membership or literature review without direct statistical support. No tissue confirmation. Flagged as hypothesis-generating.

Edge Attributes

Every relationship in the knowledge graph carries mandatory attributes for traceability and reproducibility:

AttributeValuesDescription
evidence_typeGWAS | eQTL | literature | pathway | inferredPrimary evidence source for the relationship
directionamplify | buffer | unknownWhether the exposure/gene amplifies or buffers disease risk at population level
tissuetissue/cell type name(s)Tissue or cell type(s) where the relationship is observed
strength0.0–1.0Composite score computed per scoring algorithm above
confidencelow | medium | highEvidence convergence rating per confidence rules above
sourcescitation IDsReference identifiers for traceability to primary literature

Validation Checks

The pipeline validation stage (Stage 6) enforces the following checks as a blocking gate before data is published:

  • Schema validation: All JSON outputs conform to defined schemas (Pydantic models with extra="forbid").
  • Cross-link integrity: Every referenced slug (disease, exposure, gene, pathway) resolves to an existing entity file.
  • Citation existence: Every citation ID referenced in evidence tables, edges, and briefs maps to a formatted reference entry.
  • Completeness: Every disease requires at least one exposure modifier, one top locus, one tissue entry, and non-empty population equity notes.
  • Graph integrity: No orphan edges (every edge source and target must exist as a node in the graph).
  • Brief frontmatter validity: All mechanism brief files pass schema checks on required fields (slug, title, question, related_disease, related_exposure, references).

Limitations

GWAS and eQTL data remain predominantly European-ancestry; transferability to non-European populations is limited and explicitly noted per disease module in the Population Equity Notes section.

Exposure proxies (e.g., county-level PM2.5 annual means from fixed monitors) may not capture individual-level biological exposure. Spatial interpolation introduces uncertainty, and temporal averaging masks peak exposure events.

Mechanism briefs are hypothesis-driven syntheses of existing evidence, not experimentally validated causal models. They represent the current state of knowledge and are subject to revision.

Community module models are trained on state and regional data with geographic and temporal limitations. The ecological fallacy applies: area-level associations do not imply individual-level effects.

Strength scores are composite indices for relative comparison within the atlas, not absolute effect sizes suitable for clinical or regulatory decision-making.

Literature curation reflects available publications and may be subject to selection bias, publication bias, and temporal gaps between primary research and atlas incorporation.