CSV and Excel Batch Parsing

Q: How is idempotency guaranteed when the same file is re-sent?

A SHA-256 content hash skips byte-identical rows, and an idempotency key (award_id:institution_id:budget_category) drives a database ON CONFLICT DO UPDATE, so a corrected re-send updates in place and an unchanged re-send writes nothing.

Q: Why can't I pass chunksize to pd.read_excel for large workbooks?

pd.read_excel has no chunksize parameter and loads the full sheet. Stream Excel by loading one sheet at a time and pruning unused columns, or convert to CSV upstream and stream it with pd.read_csv.

Q: What happens to a row that fails validation?

It is routed to the quarantine dead-letter queue with a structured reason and never reaches staging. After the model or mapping is corrected, quarantined rows re-validate and upsert by idempotency key.

On this page

Problem framing
Policy constraints
Data schema & field mapping
Implementation
Canonical validation model
Streaming reader + idempotent upsert with quarantine routing
Integration points
Verification & audit
Failure modes & recovery
Frequently asked questions
Related

Most institutional data still arrives as a delimited file or a workbook: a department emails a legacy equipment inventory, a sponsor exports an award ledger to .xlsx, a finance office drops a multi-tab budget template into a shared folder. These files are authored for human review, not machine ingestion, and re-sending a corrected copy is routine — so the parsing layer’s whole job is to turn that messy, repeatedly-resubmitted stream into standardized, deduplicated, audit-traceable records. This guide addresses that specific gap, and it is one of the ingestion layers anchored to the parent guide on Automated Ingestion & Data Sync Workflows. It inherits the policy and idempotency contracts established in the Grant Lifecycle Architecture Design and applies the same canonical field definitions used by API Polling & Portal Integration, so data quality is uniform regardless of how a record enters the platform.

University administrators, research compliance officers, Python automation developers, and lab managers rely on this subsystem to process heterogeneous submissions — from legacy equipment manifests to multi-institutional funding spreadsheets — without ever duplicating a financial entry or mutating an audit log.

End-to-end batch parsing path: a file is streamed into chunked rows, mapped to the canonical schema, policy-validated, and SHA-256 fingerprinted before valid rows upsert into the staging ledger by idempotency key and rejects divert to the quarantine queue.

Problem framing

File parsing looks trivial until institutional reality accumulates. The same award spreadsheet is resubmitted three times with one corrected cell; a workbook carries merged header cells and a hidden metadata row; a department’s “Indirect Cost” column is sometimes a percentage and sometimes a decimal; a 400 MB inventory export exhausts heap before the first row is validated. A naive pd.read_csv(...).to_sql(...) against that stream produces duplicate award rows, corrupts indirect-cost reconciliation, and breaks the federal chain of custody.

The job of this layer is to make re-parsing safe. Three contracts, implemented in the rest of this page, hold the line:

Idempotency. Every row is fingerprinted with SHA-256 over its canonical key columns; an idempotency key derived from the sponsor record id drives a database upsert, so re-running the same file — or a re-sent copy — updates in place instead of inserting a second row.
Policy-bounded validation. No row enters the staging store until it satisfies the compliance fields its sponsor mandates, enforced with the same rule set as the Schema Validation Pipelines.
Quarantine over failure. A malformed row is routed to a dead-letter queue with a structured rejection reason; it never crashes the batch and is never silently dropped.

Policy constraints

Compliance is the architectural constraint that bounds what this layer may accept and how long it must keep it — not a post-hoc check. The regulatory matrix codified in the University Policy Mapping Frameworks governs every parsed record.

Standard	Compliance requirement	System control
NIH Data Management & Sharing Policy	Machine-readable metadata and explicit provenance for every grant dataset	Mandatory `award_id` / `fiscal_year` presence; SHA-256 provenance per row
NSF PAPPG & Award Terms	Accurate financial categorization, PI attribution, cost-share reconciliation within 90 days	Enum-constrained budget categories; `indirect_cost_rate` capped at the negotiated rate
2 CFR 200 (Uniform Guidance)	Auditable indirect-cost and cost-share tracking; internal controls	Field-level numeric validation before commit; append-only ingestion ledger
OSHA 29 CFR 1904 / 1910.1200	Immutable safety/chemical inventory logging without retroactive alteration	Upsert-by-key (no destructive overwrite); GHS hazard tagging on asset rows
EPA TRI & RCRA	Precise CAS mapping and waste-stream quantification	CAS-format validation; routing of regulated SKUs to EHS dashboards

Operational boundary. Policy dictates what must be captured, how long it is retained, and which roles may read it; implementation handles the mechanical parsing, transformation, and routing. The parser must never silently coerce a value that violates a regulatory schema — it enforces strict typing, mandatory-field presence, and cross-column dependency checks, then quarantines anything that fails. Credential scoping and network isolation for the parsing workers are governed by the Security Boundary Configuration. No parse job may bypass these gates, regardless of a reporting deadline.

Data schema & field mapping

Raw files exhibit wide structural variability: legacy header aliases, localized date formats, merged cells, currency symbols in numeric columns. Before any row is committed, source column names are mapped to a single canonical schema. The mapping is version-controlled, so a department renaming a column becomes a reviewable diff rather than a silent ingestion break.

The canonical GrantRecord, keyed on award_id with system-owned content_hash and idempotency_key, has a one-to-many relationship to QuarantineEntry, where each rejected row keeps its source_row, rejection_reason, and correlation_id.

Canonical field	Type	Constraint	Source rule
`award_id`	`str`	required, key column, `^[A-Z0-9-]{6,}$`	NIH/NSF award identifier
`institution_id`	`str`	required, key column	2 CFR 200 recipient id (UEI)
`fiscal_year`	`int`	required, `2000–2100`	NIH Grants Policy
`budget_category`	`enum`	`{personnel, equipment, travel, materials, subaward}`	NSF PAPPG line items
`direct_costs`	`Decimal`	required, `>= 0`, 2 dp	2 CFR 200 cost principles
`indirect_cost_rate`	`Decimal`	optional, `0 ≤ r ≤ negotiated cap`	2 CFR 200 indirect cost
`effective_date`	`date`	required, ISO-8601, UTC-normalized	NSF budget period
`asset_serial`	`str \| None`	required for equipment rows	OSHA 29 CFR 1910
`content_hash`	`str`	system-generated, SHA-256	idempotency control

The content_hash and a derived idempotency_key (f"{award_id}:{institution_id}:{budget_category}") are the only system-owned fields; everything else maps from the source file. Workbooks with multi-row headers, merged cells, and embedded formulas are normalized by the dedicated routine in Parsing complex university Excel grant templates with pandas before they reach the field mapping below.

Implementation

High-volume cycles — fiscal year-end, grant-renewal windows, safety audits — generate predictable submission surges. To stay responsive, the parser streams CSV files row-chunk by row-chunk and loads Excel sheets one at a time, mapping each chunk to the canonical schema, validating with Pydantic, then upserting on a stable key. There is one Excel-specific constraint worth stating up front: pd.read_excel does not accept a chunksize parameter, so workbooks are streamed sheet-by-sheet (or converted to CSV upstream) and pruned of unused columns before the DataFrame is materialized.

Streaming plus row hashing makes re-running the same file a no-op: known hashes skip, new rows append transactionally, and malformed chunks or integrity violations isolate into the quarantine queue.

The implementation has three composable parts: a Pydantic model that enforces the canonical schema, an idempotent SQLAlchemy upsert keyed on the row identity, and an exception path that routes rejects to the quarantine queue rather than aborting the batch.

Canonical validation model

python

from datetime import date
from decimal import Decimal
from enum import StrEnum

from pydantic import BaseModel, Field, field_validator


class BudgetCategory(StrEnum):
    PERSONNEL = "personnel"
    EQUIPMENT = "equipment"
    TRAVEL = "travel"
    MATERIALS = "materials"
    SUBAWARD = "subaward"


class GrantRecord(BaseModel):
    """Source-agnostic grant row enforced before any staging write."""

    award_id: str = Field(pattern=r"^[A-Z0-9-]{6,}$")
    institution_id: str
    fiscal_year: int = Field(ge=2000, le=2100)
    budget_category: BudgetCategory
    direct_costs: Decimal = Field(ge=0, decimal_places=2)
    indirect_cost_rate: Decimal | None = Field(default=None, ge=0)
    effective_date: date
    asset_serial: str | None = None

    @field_validator("indirect_cost_rate")
    @classmethod
    def _cap_idc(cls, v: Decimal | None) -> Decimal | None:
        # 2 CFR 200: an institution's negotiated rate is the hard ceiling.
        if v is not None and v > Decimal("0.75"):
            raise ValueError("indirect_cost_rate exceeds negotiated cap")
        return v

Streaming reader + idempotent upsert with quarantine routing

python

import hashlib
import logging
from typing import Any, Iterator

import pandas as pd
from pydantic import ValidationError
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.orm import Session

logger = logging.getLogger(__name__)

KEY_COLUMNS = ("award_id", "institution_id", "budget_category")


def read_file(file_path: str, chunk_size: int = 5000) -> Iterator[pd.DataFrame]:
    """Stream a CSV in chunks. pd.read_excel has no chunksize, so an Excel
    sheet is loaded whole and yielded as a single chunk — prune unused columns
    upstream when a workbook approaches available RAM."""
    if file_path.endswith(".csv"):
        yield from pd.read_csv(file_path, chunksize=chunk_size)
    else:
        yield pd.read_excel(file_path, sheet_name=0, engine="openpyxl")


def idempotency_key(record: GrantRecord) -> str:
    """Stable key so re-sent files cannot create duplicate rows."""
    return f"{record.award_id}:{record.institution_id}:{record.budget_category}"


def content_hash(record: GrantRecord) -> str:
    """Deterministic SHA-256 over canonicalized content for provenance."""
    canonical = record.model_dump_json()
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()


def parse_and_upsert(
    file_path: str,
    session: Session,
    mapping: dict[str, str],   # source column -> canonical field
    quarantine,                # callable: (row: dict, reason: str) -> None
) -> dict[str, int]:
    """Idempotent batch parser: map, validate, upsert, or quarantine. Safe to
    execute repeatedly against the same file without side effects."""
    stats = {"processed": 0, "skipped": 0, "quarantined": 0}

    for chunk in read_file(file_path):
        # Policy gate: a chunk missing mandatory key columns never proceeds.
        renamed = chunk.rename(columns=mapping)
        if not set(KEY_COLUMNS).issubset(renamed.columns):
            stats["quarantined"] += len(renamed)
            logger.warning("Missing key columns; routing chunk to quarantine.")
            for row in renamed.to_dict("records"):
                quarantine(row, reason="missing_key_columns")
            continue

        for raw in renamed.to_dict("records"):
            # Validate against policy; non-conforming rows go to quarantine.
            try:
                record = GrantRecord.model_validate(raw)
            except ValidationError as exc:
                stats["quarantined"] += 1
                quarantine(raw, reason=exc.json())
                continue

            key, digest = idempotency_key(record), content_hash(record)

            # Atomic upsert keyed on row identity (Postgres ON CONFLICT).
            stmt = insert(StagedGrantRecord).values(
                idempotency_key=key,
                content_hash=digest,
                **record.model_dump(mode="json"),
            )
            stmt = stmt.on_conflict_do_update(
                index_elements=["idempotency_key"],
                set_={"content_hash": digest, **record.model_dump(mode="json")},
                # Skip the write entirely when content is byte-identical.
                where=(StagedGrantRecord.content_hash != digest),
            )
            result = session.execute(stmt)
            if result.rowcount:
                stats["processed"] += 1
                session.add(IngestionLedger(source=file_path, content_hash=digest, key=key))
            else:
                stats["skipped"] += 1

        session.commit()

    logger.info("Parse complete: %s", stats)
    return stats

The on_conflict_do_update clause with a content_hash guard is what makes the write idempotent at the database layer: a re-sent file whose rows are unchanged produces zero writes, a corrected cell updates the existing row in place, and a genuinely new row inserts once. Heavy files are decoupled from this commit path through Async Processing & Queue Management, so a slow ERP write never stalls the parser.

Integration points

Parsing workers never write directly to production ERP or LIMS tables; they publish validated rows to a staging schema that adjacent systems read from by idempotency_key. Each integration has an explicit contract:

ERP / financials. The ERP consumes StagedGrantRecord rows by key, applying indirect-cost and cost-share reconciliation. Because the key is stable, replaying a day’s staging rows is safe.
LIMS / lab inventory. Rows carrying an asset_serial are forwarded to the equipment and lab inventory tracking systems with GHS hazard tags intact for OSHA reporting.
Grant portals. Files that originate as portal exports share the canonical schema with API Polling & Portal Integration, so a record looks identical whether it was polled or parsed.

An example staged payload published for downstream consumers:

json

{
  "idempotency_key": "R01CA123456:ABC123DEF456:equipment",
  "content_hash": "9f2c…e1",
  "award_id": "R01CA123456",
  "institution_id": "ABC123DEF456",
  "fiscal_year": 2026,
  "budget_category": "equipment",
  "direct_costs": "15000.00",
  "indirect_cost_rate": "0.55",
  "effective_date": "2026-04-01",
  "asset_serial": "LC-MS-00471"
}

Verification & audit

Every successful row appends an entry to an append-only IngestionLedger (source file, content_hash, idempotency_key, timestamp, operator context). This ledger is the artifact compliance officers reconstruct audits from, and it lets any parse run be verified or reproduced.

To confirm a run was correct:

Count parity. processed + skipped + quarantined must equal the total rows read from the file. A gap means a row was silently dropped — a defect, not an accepted state.
Reproduce the hash. Re-run the parser against the same file; every content_hash must match the ledger and processed must be 0 on the second pass. A non-zero second pass means the parse is non-deterministic.
Quarantine reconciliation. Every dead-letter entry must carry a structured reason; the count of unresolved quarantine items is a reportable compliance metric.

python

def verify_run(session: Session, source: str) -> dict[str, int]:
    rows = session.execute(
        select(IngestionLedger).where(IngestionLedger.source == source)
    ).scalars().all()
    return {"ledger_rows": len(rows), "distinct_keys": len({r.key for r in rows})}

Because the ledger is append-only and hash-addressed, an auditor can pin any federal report back to the exact file, row, and moment it was ingested.

Failure modes & recovery

When parsing anomalies occur, resolution follows a tiered diagnostic path. Every recovery procedure is idempotent-safe: re-running it cannot create duplicates.

Symptom	Root cause	Idempotent-safe recovery
`ValidationError: direct_costs` not a number	Currency symbols / commas / `N/A` in cost columns	Add a pre-map cleaner (`replace({'$': '', ',': ''})`); re-parse — quarantined rows re-validate and upsert by key
Duplicate rows downstream	Missing `on_conflict_do_update` or a non-deterministic key	Verify `idempotency_key` derivation; collapse by key; replays then update in place
`MemoryError` on a large workbook	`pd.read_excel` loads the full sheet (no `chunksize`)	Prune unused columns before materializing; split the workbook or convert to CSV upstream and stream it
Header alias mismatch	Department renamed or merged header cells	Update the version-controlled column `mapping`; for multi-row headers use the Excel template parser

Quarantined rows retain their source payload, correlation id, and timestamped validation report. Compliance officers must be notified within 15 minutes of any validation failure that affects a federal reporting deadline, and every resolution attaches post-mortem notes to the corresponding ledger entry. When a downstream commit target is unreachable for an extended window, routing decisions follow the Fallback Routing Protocols.

Frequently asked questions

How is idempotency guaranteed when the same file is re-sent?

Two layers cooperate. A SHA-256 content_hash over the canonicalized row skips byte-identical content, and an idempotency_key (award_id:institution_id:budget_category) drives a database ON CONFLICT DO UPDATE. A re-sent file with a corrected cell updates the existing row in place; an unchanged re-send produces zero writes; a genuinely new row inserts once.

Why can't I pass chunksize to pd.read_excel for large workbooks?

pd.read_excel has no chunksize parameter — it materializes the whole sheet. Stream Excel by loading one sheet at a time and pruning unused columns before processing, or convert the workbook to CSV upstream and stream that with pd.read_csv(..., chunksize=...).

What happens to a row that fails validation?

It is routed to the quarantine (dead-letter) queue with a structured rejection reason and never reaches the staging store. The batch continues. Once the Pydantic model or column mapping is corrected, quarantined rows re-validate and upsert by their idempotency key — no manual de-duplication required.

Can the parser write directly to the production ERP?

No. Workers write only to a staging schema; the ERP and LIMS read validated rows by idempotency key. This preserves the audit boundary and lets a day of staging rows be safely replayed without side effects.

Parent guide: Automated Ingestion & Data Sync Workflows
Schema Validation Pipelines — the validation gates this layer shares
API Polling & Portal Integration — the canonical schema for polled records
Async Processing & Queue Management — decoupling parsing from ERP commits
Parsing complex university Excel grant templates with pandas — the task-level how-to for merged headers

CSV and Excel Batch Parsing

Problem framing #

Policy constraints #

Data schema & field mapping #

Implementation #

Canonical validation model #

Streaming reader + idempotent upsert with quarantine routing #

Integration points #

Verification & audit #

Failure modes & recovery #

Frequently asked questions #

Related #

Explore this section

Problem framing

Policy constraints

Data schema & field mapping

Implementation

Canonical validation model

Streaming reader + idempotent upsert with quarantine routing

Integration points

Verification & audit

Failure modes & recovery

Frequently asked questions

Related