Schema Validation Pipelines

Q: Why validate the schema itself before validating any data?

A hand-edited schema can become structurally invalid and either reject everything or silently admit bad records. Draft7Validator.check_schema() proves the schema is well-formed at startup so a misconfiguration fails fast in CI instead of corrupting an audit trail.

Q: How is validation kept deterministic and idempotent?

Records are sorted by SHA-256 content hash so the valid/invalid partition is independent of arrival order, a processed-hashes set skips already-handled records, and the database upsert is guarded by a content-hash comparison, so re-runs add no duplicate ledger entries.

Q: What happens to a record that fails validation?

It is routed to the quarantine dead-letter queue with a structured error manifest covering every violation, never reaches staging, and is never silently coerced. After the schema or mapping is corrected, quarantined records re-validate and upsert by key.

Q: How do we handle a new sponsor budget category mid-year?

Bump the schema version, deploy it to the validation registry, stamp the new schema_version onto admitted records, and re-validate the quarantined batch. Every ledger entry records the version it was validated against, so auditors can reconstruct which policy applied.

On this page

Problem framing
Policy constraints
Data schema & field mapping
Implementation
Idempotent validation engine
Committing admitted records and routing rejects
Integration points
Verification & audit
Failure modes & recovery
Frequently asked questions
Related

A research record only becomes trustworthy at the moment it is admitted. Before a grant allocation, an equipment manifest, or a chemical inventory row is committed to a production store, something must prove it conforms to the structure federal sponsors and institutional policy require — and prove it the same way every time the record is seen. That is the job of the schema validation layer: a deterministic gate that intercepts raw payloads from departmental spreadsheets, sponsor portals, and laboratory systems, rejects anything non-conforming to a quarantine queue, and admits only records that satisfy a versioned schema. This guide addresses that specific gap, and it is one of the ingestion layers anchored to the parent guide on Automated Ingestion & Data Sync Workflows. It inherits the policy and idempotency contracts established in the Grant Lifecycle Architecture Design and shares its canonical field definitions with CSV and Excel Batch Parsing and API Polling & Portal Integration, so data quality is uniform regardless of how a record enters the platform.

University administrators, research compliance officers, Python automation developers, and laboratory managers rely on this subsystem to standardize validation at the point of entry. By enforcing structural integrity before commit, the layer eliminates downstream reconciliation work, guarantees that every grant and equipment record adheres to institutional compliance frameworks, and establishes a defensible foundation for federal reporting mandates.

Problem framing

Validation looks trivial until institutional reality accumulates. A department submits legacy internal grant codes where a sponsor expects an IC-YYYY-XXXXXX award identifier; a new NSF budget category appears in an export months before the schema knows about it; a credit memo arrives as a negative amount with no normalization; a schema file is hand-edited and silently mixes Draft 7 and Draft 2020-12 keywords. A naive “accept and reconcile later” approach lets all of these into production, where they corrupt indirect-cost reconciliation and break the federal chain of custody.

The job of this layer is to make validation a policy enforcement step, not a data-correction step. Three contracts, implemented in the rest of this page, hold the line:

Determinism. The same input yields the same valid/invalid partition on every run. Records are processed in hash-sorted order so execution paths never depend on arrival order.
Idempotency. Each record is fingerprinted with SHA-256; an already-processed fingerprint is skipped, so re-running the same batch produces no duplicate audit entries and no duplicate writes.
Quarantine over correction. A non-conforming record is routed to a dead-letter queue with a structured error manifest — path, message, validator — and never silently coerced or dropped.

Policy constraints

Compliance is the architectural constraint that bounds what this layer may admit, not a post-hoc check. The regulatory matrix codified in the University Policy Mapping Frameworks governs every validated record. Treat schema validation as a policy enforcement layer: it rejects non-conforming payloads and routes them to quarantine with a structured manifest, so institutional audit trails stay immutable and downstream reporting never processes unverified records.

Regulatory standard	Validation requirement	Enforcement mechanism
NIH Grants Policy	Award identifiers must conform to `IC-YYYY-XXXXXX` format; F&A rates capped at the negotiated institutional rate	Regex pattern matching + numeric threshold validation
NSF PAPPG / Proposal Guidelines	PI ORCID/NSF ID required; budget categories restricted to approved NSF line items	Cross-reference lookup + enum constraint validation
2 CFR 200 (Uniform Guidance)	Auditable indirect-cost and cost-share tracking; internal controls before commit	Field-level numeric validation; append-only validation ledger
OSHA Hazard Communication (29 CFR 1910.1200)	Chemical inventories must carry valid CAS numbers, GHS hazard codes, storage-compatibility flags	Format validation + lookup against OSHA-compliant chemical registries
EPA Facility Reporting (TSCA/EPCRA)	Threshold quantities must trigger mandatory reporting flags; disposal codes mapped to EPA categories	Conditional logic + range validation against EPA reporting thresholds

Operational boundary. Policy dictates what must be present, which formats are legal, and how long records are retained; implementation handles the mechanical schema evaluation and routing. The validator must never silently coerce a value that violates a regulatory schema — it enforces strict typing, mandatory-field presence, enum restriction, and cross-field dependency checks, then quarantines anything that fails. Credential scoping and network isolation for the validation workers are governed by the Security Boundary Configuration. No validation run may be bypassed to meet a reporting deadline; schema drift is escalated through formal change control instead.

Data schema & field mapping

A schema is a versioned policy artifact, not a convenience. Source payloads arrive with legacy aliases, mixed numeric formats, and sponsor-specific identifiers; before any record is admitted, those fields are mapped to a single canonical schema whose constraints encode the regulatory rules above. The mapping and the schema are both version-controlled, so a sponsor adding a budget category becomes a reviewable diff and a deliberate schema version bump rather than a silent ingestion break.

Canonical field	Type	Constraint	Source rule
`award_id`	`str`	required, `^[A-Z]{2}-\d{4}-\d{6}$`	NIH/NSF award identifier
`pi_orcid`	`str`	required, `^\d{4}-\d{4}-\d{4}-\d{3}[0-9X]$`	NSF PAPPG PI attribution
`budget_category`	`enum`	`{personnel, equipment, travel, materials}`	NSF approved line items
`amount`	`number`	required, `>= 0`	2 CFR 200 cost principles
`fiscal_year`	`int`	required, `2000–2100`	NIH Grants Policy
`cas_number`	`str \| None`	required for chemical rows	OSHA 29 CFR 1910.1200
`content_hash`	`str`	system-generated, SHA-256	idempotency control
`schema_version`	`str`	system-stamped, semver	schema change control

The content_hash and schema_version are the only system-owned fields; everything else maps from the source payload. Stamping the resolved schema_version onto every admitted record is what lets an auditor later prove which version of policy a record was validated against — essential when a category was added mid-year.

Implementation

The validation layer has three composable parts: a pre-flight that proves the schema itself is structurally valid before any data flows, a deterministic and idempotent validation engine that partitions a batch into valid records and error manifests, and an upsert path that commits admitted records by a stable key while routing rejects to quarantine.

Figure: deterministic ordering plus a processed-hash set yield reproducible valid/invalid partitions on every run.

Idempotent validation engine

Production validation must be strictly idempotent: repeated execution against the same input yields identical output without side effects, duplicate processing, or state mutation. The engine below uses jsonschema for policy evaluation and SHA-256 for deduplication, and it reports every violation in a record rather than only the first — compliance officers need the complete error manifest, not a single line.

python

import hashlib
import json
import logging
from jsonschema import SchemaError, Draft7Validator

# Configure deterministic, non-mutating logger
logging.basicConfig(level=logging.INFO, format="%(asctime)s [VALIDATOR] %(message)s")
logger = logging.getLogger(__name__)


class IdempotentSchemaValidator:
    """Deterministic, idempotent JSON Schema validator for compliance pipelines.

    Guarantees identical output across repeated runs and prevents duplicate
    processing of records that have already been admitted or quarantined.
    """

    def __init__(self, schema: dict, schema_version: str) -> None:
        # Pre-flight: prove the schema itself is structurally valid before it
        # is allowed to gate any data. check_schema() raises SchemaError for a
        # malformed schema without needing a sample instance.
        Draft7Validator.check_schema(schema)
        self.schema = schema
        self.schema_version = schema_version
        self._validator = Draft7Validator(schema)

    @staticmethod
    def _content_hash(record: dict) -> str:
        """Deterministic SHA-256 over canonicalized content for dedup + provenance."""
        canonical = json.dumps(record, sort_keys=True, separators=(",", ":"))
        return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

    def validate_batch(
        self,
        records: list[dict],
        processed_hashes: set[str],
    ) -> tuple[list[dict], list[dict]]:
        """Validate a batch and return (valid_records, error_manifests) in
        deterministic order. Already-processed records are skipped to keep the
        run idempotent."""
        valid_records: list[dict] = []
        error_manifests: list[dict] = []

        # Sort by hash so processing order — and therefore output order — is
        # independent of how records arrived.
        for record in sorted(records, key=self._content_hash):
            digest = self._content_hash(record)
            if digest in processed_hashes:
                continue  # Idempotent skip

            errors = list(self._validator.iter_errors(record))
            if not errors:
                valid_records.append({
                    "status": "valid",
                    "content_hash": digest,
                    "schema_version": self.schema_version,
                    "payload": record,
                })
            else:
                # Report ALL violations, not just the first.
                error_manifests.append({
                    "status": "invalid",
                    "content_hash": digest,
                    "schema_version": self.schema_version,
                    "errors": [
                        {
                            "error_path": list(e.absolute_path),
                            "error_message": e.message,
                            "validator": e.validator,
                        }
                        for e in errors
                    ],
                    "payload": record,
                })

        return valid_records, error_manifests


GRANT_SCHEMA = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "required": ["award_id", "pi_orcid", "budget_category", "amount"],
    "properties": {
        "award_id": {"type": "string", "pattern": r"^[A-Z]{2}-\d{4}-\d{6}$"},
        "pi_orcid": {"type": "string", "pattern": r"^\d{4}-\d{4}-\d{4}-\d{3}[0-9X]$"},
        "budget_category": {
            "type": "string",
            "enum": ["personnel", "equipment", "travel", "materials"],
        },
        "amount": {"type": "number", "minimum": 0},
    },
    "additionalProperties": False,
}

The engine guarantees idempotency in five concrete ways:

Schema pre-flight — Draft7Validator.check_schema() validates the schema structure at startup, catching a misconfigured schema before any record is processed.
All-errors reporting — iter_errors() collects every violation, giving compliance officers a complete manifest.
Deterministic ordering — records are sorted by content hash, so execution paths never depend on arrival order.
Stateless validation — the validator performs pure schema evaluation without mutating inputs or external state.
Hash deduplication — a processed_hashes set prevents re-validation of identical payloads, so re-runs add no duplicate audit entries.

Committing admitted records and routing rejects

Valid records are upserted to a staging store on a stable key, so a re-validated record updates in place rather than inserting a second row; invalid records are routed to the quarantine queue with their manifest intact. Heavy batches are decoupled from this commit path through Async Processing & Queue Management, so a slow store write never stalls validation.

python

from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.orm import Session


def commit_validated(
    valid_records: list[dict],
    error_manifests: list[dict],
    session: Session,
    quarantine,  # callable: (manifest: dict) -> None
) -> dict[str, int]:
    """Upsert admitted records by award_id; route rejects to quarantine.
    Safe to execute repeatedly against the same batch without side effects."""
    stats = {"admitted": 0, "skipped": 0, "quarantined": 0}

    for item in valid_records:
        payload = item["payload"]
        stmt = insert(StagedGrantRecord).values(
            award_id=payload["award_id"],
            content_hash=item["content_hash"],
            schema_version=item["schema_version"],
            **payload,
        )
        # Skip the write entirely when content is byte-identical (idempotent).
        stmt = stmt.on_conflict_do_update(
            index_elements=["award_id"],
            set_={"content_hash": item["content_hash"], **payload},
            where=(StagedGrantRecord.content_hash != item["content_hash"]),
        )
        result = session.execute(stmt)
        if result.rowcount:
            stats["admitted"] += 1
            session.add(ValidationLedger(
                award_id=payload["award_id"],
                content_hash=item["content_hash"],
                schema_version=item["schema_version"],
            ))
        else:
            stats["skipped"] += 1

    for manifest in error_manifests:
        stats["quarantined"] += 1
        quarantine(manifest)

    session.commit()
    return stats

The on_conflict_do_update clause guarded by a content_hash comparison is what makes the write idempotent at the database layer: a re-validated batch whose records are unchanged produces zero writes, a corrected record updates in place, and a genuinely new record inserts once. For schema artifacts, apply semantic versioning and lint every schema against its meta-schema in CI before deployment.

Integration points

Validation workers never write directly to production ERP or LIMS tables; they admit records to a staging schema that adjacent systems read from by key, and they emit error manifests that the quarantine subsystem owns. Each integration has an explicit contract:

ERP / financials. The ERP consumes admitted StagedGrantRecord rows by award_id, applying indirect-cost and cost-share reconciliation. Because validation already enforced format and range, the ERP never sees a malformed award identifier.
LIMS / lab inventory. Chemical and equipment records that pass CAS and GHS checks are forwarded to the equipment and lab inventory tracking systems with hazard tags intact for OSHA reporting.
Grant portals. Records that originate as portal exports share this exact schema with API Polling & Portal Integration, so a record validates identically whether it was polled or parsed.

An example error manifest published to the quarantine queue for a rejected record:

json

{
  "status": "invalid",
  "content_hash": "9f2c…e1",
  "schema_version": "2.3.0",
  "errors": [
    {"error_path": ["award_id"], "error_message": "'INVALID-123' does not match '^[A-Z]{2}-\\d{4}-\\d{6}$'", "validator": "pattern"},
    {"error_path": ["budget_category"], "error_message": "'subawards' is not one of ['personnel', 'equipment', 'travel', 'materials']", "validator": "enum"}
  ],
  "payload": {"award_id": "INVALID-123", "pi_orcid": "0000-0002-1825-0097", "budget_category": "subawards", "amount": 5000}
}

Verification & audit

Every admitted record appends an entry to an append-only ValidationLedger (award id, content_hash, schema_version, timestamp, operator context). This ledger is the artifact compliance officers reconstruct audits from, and it lets any validation run be verified or reproduced.

To confirm a run was correct:

Count parity. admitted + skipped + quarantined must equal the total records read. A gap means a record was silently dropped — a defect, not an accepted state.
Reproduce the partition. Re-run the validator against the same batch with the prior processed_hashes loaded; admitted must be 0 and every content_hash must match the ledger. A non-zero second pass means validation is non-deterministic.
Quarantine reconciliation. Every dead-letter entry must carry a structured manifest; the count of unresolved quarantine items is a reportable compliance metric.

python

from sqlalchemy import select


def verify_run(session: Session, schema_version: str) -> dict[str, int]:
    rows = session.execute(
        select(ValidationLedger).where(
            ValidationLedger.schema_version == schema_version
        )
    ).scalars().all()
    return {"ledger_rows": len(rows), "distinct_awards": len({r.award_id for r in rows})}

Because the ledger is append-only and hash-addressed, an auditor can pin any federal report back to the exact record, schema version, and moment it was admitted. All validation logs must be retained for a minimum of seven years per federal audit requirements, with cryptographic checksums applied to prevent tampering.

Failure modes & recovery

When validation pipelines encounter structural drift or legacy format incompatibilities, operators isolate failures without halting broader ingestion. Every recovery procedure is idempotent-safe: re-running it cannot create duplicates.

Symptom	Root cause	Idempotent-safe recovery
`ValidationError: 'award_id' does not match pattern`	Departmental spreadsheets use legacy internal grant codes instead of NIH/NSF formats	Update the mapping layer with a backward-compatible alias resolution before validation; re-run — quarantined records re-validate and upsert by key
`ValidationError: 'budget_category' is not one of [...]`	New NSF categorical code (e.g. `subawards`, `participant_support`) not yet in the schema	Bump the schema version, deploy to the validation registry, notify compliance officers; re-validate the quarantined batch against the new version
`ValidationError: 'amount' is less than the minimum`	Negative depreciation adjustments or credit memos submitted without normalization	Add a pre-validation transform (absolute value or a separate `adjustment_type` field); re-validate
`SchemaError: Invalid schema structure`	Corrupted schema file or mixed Draft 7 / Draft 2020-12 keywords	Lint the schema against its meta-schema with `Draft7Validator.check_schema()` in CI before deployment; roll back to the last valid `schema_version`

Role boundaries. Compliance officers own policy definitions and approve schema version promotions; they do not modify validation code. Python automation developers own pipeline idempotency, error routing, and performance; they do not alter regulatory thresholds. Laboratory managers own data accuracy at the source and correct quarantined payloads before resubmission. University administrators own uptime, audit retention, and cross-departmental SLA enforcement. When legacy systems generate unpredictable payloads, buffer them in quarantine with automated retry rather than bypassing validation; when a downstream commit target is unreachable for an extended window, routing follows the Fallback Routing Protocols.

Frequently asked questions

Why validate the schema itself before validating any data?

A hand-edited schema can become structurally invalid — for example by mixing Draft 7 and Draft 2020-12 keywords — and a broken schema either rejects everything or, worse, silently admits records it should reject. Draft7Validator.check_schema() proves the schema is well-formed at startup, before a single record is processed, so a misconfiguration fails fast in CI instead of corrupting an audit trail in production.

How is validation kept deterministic and idempotent?

Records are sorted by their SHA-256 content hash before processing, so the valid/invalid partition is independent of arrival order. A processed_hashes set skips records that were already admitted or quarantined, and the database upsert is guarded by a content-hash comparison, so re-running the same batch adds no duplicate ledger entries and writes nothing for unchanged records.

What happens to a record that fails validation?

It is routed to the quarantine (dead-letter) queue with a structured error manifest — error_path, error_message, and validator for every violation in the record, not just the first. It never reaches the staging store and is never silently coerced. Once the schema or mapping is corrected, quarantined records re-validate and upsert by key with no manual de-duplication.

How do we handle a new sponsor budget category mid-year?

Do not edit production data to fit the old schema. Bump the schema version (semver), deploy it to the validation registry, and stamp the new schema_version onto admitted records. Re-validate the quarantined batch against the new version. Because every ledger entry records the version it was validated against, an auditor can always reconstruct which policy applied to which record.

Parent guide: Automated Ingestion & Data Sync Workflows
CSV and Excel Batch Parsing — the parsing layer that feeds rows into these gates
API Polling & Portal Integration — the canonical schema for polled sponsor records
Async Processing & Queue Management — decoupling validation from store commits
University Policy Mapping Frameworks — the regulatory matrix these schemas encode

Schema Validation Pipelines

Problem framing #

Policy constraints #

Data schema & field mapping #

Implementation #

Idempotent validation engine #

Committing admitted records and routing rejects #

Integration points #

Verification & audit #

Failure modes & recovery #

Frequently asked questions #

Related #

Related guides

Problem framing

Policy constraints

Data schema & field mapping

Implementation

Idempotent validation engine

Committing admitted records and routing rejects

Integration points

Verification & audit

Failure modes & recovery

Frequently asked questions

Related