How to map NIH grant schemas to internal databases

Q: Why enforce the 2 CFR 200 indirect-cost cap inside the Pydantic model instead of in the database?

Putting the cap in the validation model makes policy a version-controlled artifact that fails fast with a human-readable error before any write. A database CHECK constraint rejects the row only after the round trip and with an opaque error, so validation at the schema boundary keeps policy enforcement decoupled from persistence.

Q: Is it safe to re-run the mapper over the same NIH payload?

Yes. ON CONFLICT DO UPDATE keyed on the unique nih_project_id updates only the mutable financial fields and sets compliance_status to UPDATED. No duplicate row is created and referential integrity is preserved.

On this page

Problem statement
Prerequisites
Step-by-step implementation
Step 1 — Define the validation model as the policy boundary
Step 2 — Declare the internal table and the audit fingerprint
Step 3 — Ingest with an idempotent upsert
Schema and field reference
Verification
Troubleshooting
Frequently asked questions
Related

Problem statement

You need a deterministic Python pipeline that takes a raw National Institutes of Health award record, aligns its fields to your institution’s grant table, enforces federal cost rules at validation time, and writes the row idempotently with a verifiable audit hash — so that re-running an NIH RePORTER sync never duplicates a grant or breaks a federal data-integrity audit.

This task sits under Grant Lifecycle Architecture Design, part of the broader Core Architecture & Policy Mapping for Research Grants practice. The mapping layer is intentionally narrow: it normalizes federal payloads, enforces policy at the schema boundary, and persists the result through an idempotent upsert. It does not adjudicate compliance state — it captures, fingerprints, and surfaces anomalies for the compliance officer, consistent with the separation of concerns the parent architecture establishes.

Prerequisites

Before deploying the mapper, confirm the following environment and policy configuration:

Python 3.10+ (the code uses modern type hints and datetime.timezone.utc).
Libraries: pydantic>=2.5 for strict schema validation and SQLAlchemy>=2.0 with psycopg[binary] for transactional PostgreSQL upserts. Install with pip install "pydantic>=2.5" "SQLAlchemy>=2.0" "psycopg[binary]".
Environment variables (never hard-code credentials, per Security Boundary Configuration):
- GRANT_DB_URL — the SQLAlchemy connection string, e.g. postgresql+psycopg://svc_grants:***@db.internal:5432/research.
Policy config: a version-controlled institutional F&A (facilities & administrative) rate agreement and the field-mapping table below, kept alongside your University Policy Mapping Frameworks. Acquisition of the raw record itself is handled upstream by API Polling & Portal Integration; this page assumes a JSON payload is already in hand.

Step-by-step implementation

The flow below is enforced by the mapper: a raw NIH payload is validated against the institutional schema (which embeds the 2 CFR 200 cost rules), fingerprinted for audit, and written through an idempotent upsert keyed on nih_project_id. The unique key is what makes a re-run safe.

Figure: federal payload fields are normalized into the internal table; the unique nih_project_id drives idempotent upserts.

Step 1 — Define the validation model as the policy boundary

The Pydantic model is where federal cost policy is enforced. Field aliases map NIH RePORTER’s verbose key names onto compact internal attributes, and the validators reject any record that violates the 2 CFR 200 indirect-cost cap or whose direct + indirect total does not reconcile. A malformed record is never allowed to reach the ORM layer.

python

import hashlib
import logging
from datetime import datetime, timezone
from typing import Optional, Dict, Any

from pydantic import BaseModel, Field, field_validator, ValidationError, ConfigDict

# ---------------------------------------------------------------------------
# POLICY BOUNDARY: validation rules enforce federal compliance thresholds.
# ---------------------------------------------------------------------------
class NIHGrantPayload(BaseModel):
    # populate_by_name=True lets callers pass either the field name or its alias.
    model_config = ConfigDict(populate_by_name=True)

    project_id: str = Field(..., alias="project_id", min_length=6, max_length=50)
    pi_name: str = Field(..., alias="principal_investigator", min_length=3, max_length=150)
    title: str = Field(..., alias="project_title", min_length=5, max_length=500)
    mechanism: str = Field(..., alias="mechanism_code", pattern=r"^[A-Z]\d{2}$")
    total_cost: float = Field(..., ge=0)
    direct_cost: float = Field(..., ge=0)
    indirect_cost: float = Field(..., ge=0)
    start_date: str = Field(..., alias="period_start", pattern=r"^\d{4}-\d{2}-\d{2}$")
    end_date: str = Field(..., alias="period_end", pattern=r"^\d{4}-\d{2}-\d{2}$")
    raw_payload: Optional[Dict[str, Any]] = None

    @field_validator("indirect_cost")
    @classmethod
    def enforce_faa_compliance(cls, v: float, info) -> float:
        # Policy: indirect costs must not exceed 60% of total cost per 2 CFR 200.414.
        if v > (info.data.get("total_cost", 0) * 0.60):
            raise ValueError("Indirect cost exceeds 2 CFR 200 institutional cap.")
        return v

    @field_validator("total_cost")
    @classmethod
    def validate_cost_balance(cls, v: float, info) -> float:
        direct = info.data.get("direct_cost", 0)
        indirect = info.data.get("indirect_cost", 0)
        if abs(v - (direct + indirect)) > 0.01:
            raise ValueError("Total cost mismatch: direct + indirect != total.")
        return v

Step 2 — Declare the internal table and the audit fingerprint

The ORM model is the persistence target. The nih_project_id column carries a unique constraint — the single guarantee that powers idempotency — and audit_hash stores a deterministic SHA-256 digest computed after normalization but before insertion, so the recorded fingerprint always matches what landed on disk.

python

from sqlalchemy import create_engine, Column, String, Float, DateTime, Integer
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy.dialects.postgresql import insert as pg_insert

# ---------------------------------------------------------------------------
# IMPLEMENTATION BOUNDARY: database schema & idempotent ingestion logic.
# ---------------------------------------------------------------------------
Base = declarative_base()

class InternalGrantRecord(Base):
    __tablename__ = "grant_records"
    id = Column(Integer, primary_key=True, autoincrement=True)
    nih_project_id = Column(String(50), unique=True, nullable=False)  # idempotency key
    principal_investigator = Column(String(150), nullable=False)
    project_title = Column(String(500), nullable=False)
    mechanism_code = Column(String(20), nullable=False)
    total_cost = Column(Float, nullable=False)
    direct_cost = Column(Float, nullable=False)
    indirect_cost = Column(Float, nullable=False)
    period_start = Column(String(10), nullable=False)
    period_end = Column(String(10), nullable=False)
    compliance_status = Column(String(20), default="PENDING_REVIEW")
    audit_hash = Column(String(64), nullable=False)
    mapped_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))

def generate_audit_hash(payload: Dict[str, Any]) -> str:
    """Deterministic SHA-256 hash for compliance audit trails."""
    canonical = "".join(f"{k}:{v}" for k, v in sorted(payload.items()) if v is not None)
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

Step 3 — Ingest with an idempotent upsert

The driver validates, fingerprints, and writes. The PostgreSQL ON CONFLICT DO UPDATE clause keyed on nih_project_id means a second run with the same record updates only the mutable financial fields and flips compliance_status to UPDATED — it never inserts a duplicate row or orphans a foreign-key reference. Validation failures are logged and re-raised so the caller can route the payload to a quarantine queue rather than silently dropping it.

python

def ingest_nih_grant(db_url: str, raw_json: Dict[str, Any]) -> str:
    """Idempotent ingestion with conflict resolution and audit logging.

    raw_json keys may use the field aliases defined on NIHGrantPayload
    (e.g. 'principal_investigator', 'project_title', 'mechanism_code',
    'period_start', 'period_end') or the model field names, because
    populate_by_name=True is set.
    """
    engine = create_engine(db_url, pool_pre_ping=True)
    Base.metadata.create_all(engine)

    try:
        validated = NIHGrantPayload.model_validate(raw_json)
    except ValidationError as e:
        logging.error(f"Schema validation failed: {e}")
        raise  # caller routes the rejected payload to the quarantine queue

    audit_hash = generate_audit_hash(validated.model_dump())
    record = {
        "nih_project_id": validated.project_id,
        "principal_investigator": validated.pi_name,
        "project_title": validated.title,
        "mechanism_code": validated.mechanism,
        "total_cost": validated.total_cost,
        "direct_cost": validated.direct_cost,
        "indirect_cost": validated.indirect_cost,
        "period_start": validated.start_date,
        "period_end": validated.end_date,
        "audit_hash": audit_hash,
        "compliance_status": "PENDING_REVIEW",
        "mapped_at": datetime.now(timezone.utc),
    }

    with Session(engine) as session:
        stmt = pg_insert(InternalGrantRecord).values(record)
        # PostgreSQL idempotent upsert: ON CONFLICT requires the dialect insert.
        upsert = stmt.on_conflict_do_update(
            index_elements=["nih_project_id"],
            set_={
                "total_cost": stmt.excluded.total_cost,
                "direct_cost": stmt.excluded.direct_cost,
                "indirect_cost": stmt.excluded.indirect_cost,
                "compliance_status": "UPDATED",
                "audit_hash": audit_hash,
                "mapped_at": datetime.now(timezone.utc),
            },
        )
        session.execute(upsert)
        session.commit()
        logging.info(f"Ingested/updated grant {validated.project_id}")
        return audit_hash

This isolates validation, transformation, and persistence into discrete, testable units. Records arriving from RePORTER should be normalized to match the NIHGrantPayload alias mapping before ingestion; the same validated record can then feed the institutional rule engine described in Setting Up Automated Policy Compliance Checks for University Grants.

Schema and field reference

The mapper aligns these NIH source fields onto the internal table. Widen the set in your version-controlled policy config rather than in code.

Internal field	Type	Constraint	NIH source / rule
`nih_project_id`	string	Unique, 6–50 chars; idempotency key	RePORTER `project_num` (award identifier)
`principal_investigator`	string	3–150 chars	RePORTER `principal_investigators[].full_name`
`mechanism_code`	string	Matches `^[A-Z]\d{2}$` (e.g. R01, K99)	NIH activity code (Grants Policy Statement)
`total_cost`	float	≥ 0; must equal direct + indirect	2 CFR 200 cost-principle reporting
`indirect_cost`	float	≥ 0; ≤ 60% of total	2 CFR 200.414 F&A cap (institution rate agreement)
`period_start` / `period_end`	string	ISO-8601 date `YYYY-MM-DD`	RePORTER budget/project period
`audit_hash`	string	64-char SHA-256 hex	Non-repudiation audit trail (NIH data integrity)

Verification

Confirm a run behaved correctly before trusting its output:

Validate in isolation: run NIHGrantPayload.model_validate(payload) in a REPL and capture any ValidationError to confirm field aliases and the cost-balance rule resolve against your sample record.
Reproduce the hash: re-run generate_audit_hash on the validated model_dump() and confirm it equals the stored value — SELECT audit_hash FROM grant_records WHERE nih_project_id = :pid. An equal hash proves the row matches its recorded fingerprint.
Dry-run idempotency: execute ingest_nih_grant twice with the identical payload. Exactly one row must exist for that nih_project_id, and compliance_status must read UPDATED after the second pass.
Cost reconciliation: confirm total_cost = direct_cost + indirect_cost for the persisted row and that indirect_cost / total_cost does not exceed your negotiated F&A rate.

Troubleshooting

Three failure modes specific to mapping NIH schemas:

Null indirect-cost field crashes the upsert. RePORTER sometimes returns a nested budget object without flattened totals, so a None reaches the not-null column. Flatten and coerce budget fields before model_validate; never let None past the schema boundary. If the indirect figure is legitimately absent, route the record to the quarantine queue for a compliance officer rather than defaulting it to zero.
Duplicate nih_project_id raises IntegrityError. The idempotency layer was bypassed — usually a plain Session.add slipped in instead of the dialect pg_insert(...).on_conflict_do_update(...). Confirm the upsert path is active and that pool_pre_ping=True is set so a stale connection cannot abort the transaction mid-write.
Audit hash differs on re-run. The payload mutated between validation and commit, or a volatile field (a re-indexed timestamp) entered the canonical string. Compute the hash from the normalized model_dump() immediately after validation, and if RePORTER is degraded, divert acquisition through your Fallback Routing Protocols until the schema reconciles. Bulk reconciliation belongs on the Schema Validation Pipelines path, not a single synchronous map.

Frequently asked questions

Why enforce the 2 CFR 200 indirect-cost cap inside the Pydantic model instead of in the database?

Putting the cap in the validation model makes policy a first-class, version-controlled artifact that fails fast with a human-readable error before any write occurs. A database CHECK constraint would reject the row too, but only after the round trip, and with an opaque error that is harder to route to a compliance officer. Validation at the schema boundary keeps policy enforcement decoupled from persistence.

Is it safe to re-run the mapper over the same NIH payload?

Yes. The ON CONFLICT DO UPDATE clause keyed on the unique nih_project_id means a repeat run updates only the mutable financial fields and sets compliance_status to UPDATED. No duplicate row is created and referential integrity is preserved, so an accidental double sync is harmless.

Should the mapper decide whether a grant is compliant?

No. The mapper validates structure and federal cost rules, then writes the record with compliance_status = PENDING_REVIEW. Adjudicating compliance is a human review step; the pipeline surfaces violations (cap breach, cost mismatch, missing fields) for an officer rather than silently approving or correcting them.

How to map NIH grant schemas to internal databases

Problem statement #

Prerequisites #

Step-by-step implementation #

Step 1 — Define the validation model as the policy boundary #

Step 2 — Declare the internal table and the audit fingerprint #

Step 3 — Ingest with an idempotent upsert #

Schema and field reference #

Verification #

Troubleshooting #

Frequently asked questions #

Related #

Problem statement

Prerequisites

Step-by-step implementation

Step 1 — Define the validation model as the policy boundary

Step 2 — Declare the internal table and the audit fingerprint

Step 3 — Ingest with an idempotent upsert

Schema and field reference

Verification

Troubleshooting

Frequently asked questions

Related