Parsing complex university Excel grant templates with pandas

Q: Why use openpyxl for headers instead of pandas' header argument?

pd.read_excel cannot see merged-cell ranges and leaves NaN in spanned columns. openpyxl exposes ws.merged_cells.ranges, letting you propagate each group's anchor value across the columns it covers to flatten a merged header into a validatable label set.

Q: Is it safe if the batch runs twice on the same directory?

Yes. Each output filename embeds the source file's SHA-256 checksum, so a second run logs an idempotent skip and writes nothing. A changed file produces a new checksum and is parsed fresh, so corrections are never missed and duplicates never created.

Q: Why copy failed files to quarantine instead of moving them?

The source directory is immutable for audit purposes. shutil.copy2 preserves the original in place while placing a timestamped copy in the quarantine queue, keeping the chain of custody under 2 CFR 200.303 intact.

On this page

Problem statement
Prerequisites
Step-by-step implementation
Step 1 — Pin the canonical schema and audit logging
Step 2 — Fingerprint the file for idempotency
Step 3 — Resolve merged, multi-row headers
Step 4 — Validate and cast against the canonical schema
Step 5 — Orchestrate the idempotent batch run
Schema and field reference
Verification
Troubleshooting
Frequently asked questions
Related

Problem statement

You need to batch-parse university grant .xlsx templates whose multi-row merged headers, hidden metadata rows, and currency-formatted cost columns shatter a plain pd.read_excel() call — reconciling them to a single canonical compliance schema and quarantining anything malformed without ever mutating the source file, so award setup, indirect-cost allocation, and 2 CFR 200 audit trails stay correct.

This task sits under CSV and Excel Batch Parsing, part of the broader Automated Ingestion & Data Sync Workflows practice. It implements the header-normalization step the parent layer defers to before its canonical field mapping runs: a workbook is reduced to a flat, predictable header, validated against the same policy fields enforced by the Schema Validation Pipelines, and only then handed downstream. Like every layer here, it inherits the separation of concerns codified in the Grant Lifecycle Architecture Design: it captures and structures data, it does not interpret compliance state.

University templates are engineered for human review, not machine ingestion. A sponsor exports a budget justification with a two-row header where “Personnel” spans four merged columns; a finance office leaves a hidden instructions row above the data; a department types $15,000.00 and N/A into a column the schema expects as a float. Each of these silently breaks naive parsing, and a single failure can delay award setup or misallocate an indirect cost rate.

Prerequisites

Before running the parser, confirm the following environment and policy configuration:

Python 3.10+ (the code uses list[str] / X | None union syntax and dict[str, Any] generics).
Libraries: pandas>=2.0, openpyxl>=3.1 (the read engine and the structural-inspection layer for merged cells). Install with pip install "pandas>=2.0" "openpyxl>=3.1".
Environment / paths: a read-only source/ directory of incoming .xlsx files, a writable output/ directory for deterministic CSVs, and a quarantine/ directory for rejects. Worker credential scope and network isolation follow the Security Boundary Configuration.
Policy config: the canonical compliance schema below, version-controlled alongside your University Policy Mapping Frameworks, so a sponsor renaming a column becomes a reviewable diff rather than a silent ingestion break.
Scheduler: cron or a systemd timer. The pipeline is idempotent — a checksum guard makes an accidental double-run a no-op.

Step-by-step implementation

The flow below resolves merged header regions across variable row depths, maps them to the canonical schema, and guarantees idempotent execution. It uses openpyxl for structural inspection and pandas for typed coercion. Structural failures are copied (not moved) to quarantine, preserving the original for review.

Figure: the checksum guard makes re-parsing safe; structural failures are copied (not moved) to the quarantine queue for review.

Step 1 — Pin the canonical schema and audit logging

Structured logging to a dedicated audit file is the basis of non-repudiation. The schema is the policy contract every parsed row must satisfy; pin it so drift is detected, not silently absorbed.

python

import pandas as pd
import openpyxl
from pathlib import Path
import hashlib
import logging
import shutil
from datetime import datetime
from typing import Any

# Audit-safe logging: every run appends to an immutable file AND streams to the
# scheduler's job log with ISO timestamps and structured severity levels.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[
        logging.FileHandler("grant_parsing_audit.log"),
        logging.StreamHandler(),
    ],
)

# Canonical compliance schema enforced across all university grant templates.
# Aligns with NIH/NSF indirect-cost tracking and OSHA/EPA lab requirements.
CANONICAL_SCHEMA: dict[str, Any] = {
    "Award Number": str,
    "PI Last Name": str,
    "Department Code": str,
    "Direct Costs": float,
    "Indirect Cost Rate": float,
    "Start Date": "datetime64[ns]",
    "End Date": "datetime64[ns]",
}

Step 2 — Fingerprint the file for idempotency

A SHA-256 checksum over the raw bytes is the idempotency anchor: it embeds in the output filename and proves provenance against the audit log.

python

def compute_sha256(filepath: Path) -> str:
    """Generate a SHA-256 checksum for immutable file tracking."""
    sha256 = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

Step 3 — Resolve merged, multi-row headers

The hard part. Open the workbook read_only=True (the zero-mutation guarantee), forward-fill values across the candidate header rows, then propagate each merged range’s anchor value across the columns it spans. Empty positions get a stable Column_N placeholder so downstream validation reports a precise KeyError rather than crashing.

python

def resolve_merged_headers(workbook_path: Path, max_header_rows: int = 4) -> list[str]:
    """
    Flatten openpyxl merged cells into a forward-filled header list.
    Handles the multi-row headers common in NIH/NSF budget justifications.
    """
    wb = openpyxl.load_workbook(workbook_path, read_only=True, data_only=True)
    ws = wb.active
    if ws is None:
        raise ValueError("No active worksheet found in grant template.")

    raw_rows = []
    for row_idx in range(1, max_header_rows + 1):
        row_data = [cell.value for cell in next(ws.iter_rows(min_row=row_idx, max_row=row_idx))]
        raw_rows.append(row_data)

    col_count = max(len(r) for r in raw_rows)
    flat_header = [""] * col_count

    # Forward-fill across the stacked header rows.
    for row in raw_rows:
        for i, val in enumerate(row):
            if val is not None:
                flat_header[i] = val

    # Propagate each merged range's anchor value across the columns it spans.
    for merged_range in ws.merged_cells.ranges:
        min_c, max_c = merged_range.min_col - 1, merged_range.max_col - 1
        anchor_val = None
        for row in raw_rows:
            for c_idx in range(min_c, max_c + 1):
                if c_idx < len(row) and row[c_idx] is not None:
                    anchor_val = row[c_idx]
                    break
            if anchor_val is not None:
                break
        if anchor_val is not None:
            for c in range(min_c, max_c + 1):
                if c < len(flat_header):
                    flat_header[c] = anchor_val

    wb.close()
    return [str(h).strip() if h else f"Column_{i}" for i, h in enumerate(flat_header)]

Step 4 — Validate and cast against the canonical schema

No row is exported until it satisfies every mandatory field with the correct type. Dates are coerced with errors="coerce"; rows that fail period-of-performance parsing are dropped (and counted), never silently kept as NaT.

python

def validate_and_cast(df: pd.DataFrame, schema: dict[str, Any]) -> pd.DataFrame:
    """Enforce the canonical schema with strict type coercion."""
    missing = set(schema.keys()) - set(df.columns)
    if missing:
        raise KeyError(f"Missing mandatory compliance fields: {missing}")

    for col, dtype in schema.items():
        if dtype == "datetime64[ns]":
            df[col] = pd.to_datetime(df[col], errors="coerce")
        else:
            df[col] = df[col].astype(dtype)

    # 2 CFR 200 / NSF period-of-performance: a row without valid dates is invalid.
    df = df.dropna(subset=["Start Date", "End Date"])
    return df

Step 5 — Orchestrate the idempotent batch run

The driver ties it together: checksum, skip-if-already-parsed, resolve headers, validate, and write a deterministic CSV — or copy the offending workbook to quarantine with a structured log entry. A failure never aborts the batch.

python

def process_grant_templates(
    source_dir: Path,
    output_dir: Path,
    quarantine_dir: Path,
    schema: dict[str, Any] = CANONICAL_SCHEMA,
) -> None:
    """Idempotent batch parser; safe to re-run — skips already-parsed files by checksum."""
    for dir_path in (source_dir, output_dir, quarantine_dir):
        dir_path.mkdir(parents=True, exist_ok=True)

    for excel_file in source_dir.glob("*.xlsx"):
        file_hash = compute_sha256(excel_file)
        output_file = output_dir / f"{excel_file.stem}_{file_hash[:8]}.csv"

        if output_file.exists():  # idempotency guard
            logging.info(f"Skipping {excel_file.name}: output exists (hash {file_hash[:8]})")
            continue

        try:
            logging.info(f"Processing {excel_file.name}...")
            headers = resolve_merged_headers(excel_file)

            # Read with no header assumption, then apply the resolved header.
            df = pd.read_excel(excel_file, header=None, engine="openpyxl")
            df.columns = headers
            df = df.iloc[1:].reset_index(drop=True)

            # Trim whitespace from string columns before validation.
            df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

            df_validated = validate_and_cast(df, schema)
            df_validated.to_csv(output_file, index=False, date_format="%Y-%m-%d")
            logging.info(f"Parsed and exported {output_file.name}")

        except Exception as exc:
            logging.error(f"Failed to parse {excel_file.name}: {exc}")
            stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            quarantine_path = quarantine_dir / f"QUARANTINE_{excel_file.stem}_{stamp}.xlsx"
            shutil.copy2(excel_file, quarantine_path)  # copy, never move
            logging.warning(f"Routed to quarantine: {quarantine_path}")

Heavy or slow workbooks can be decoupled from this commit path through Async Processing & Queue Management so a long parse never stalls the batch. Once a file produces a clean header, the canonical field-mapping and upsert logic in the parent CSV and Excel Batch Parsing layer takes over.

Schema and field reference

The parser enforces this canonical set. Widen it in your version-controlled policy config rather than in code; OSHA/EPA auxiliary columns (hazard codes, CAS numbers) map to a separate validation layer rather than being forced into the core financial schema.

Field	Type	Constraint	Source rule
`Award Number`	`str`	required, key column	NIH/NSF award identifier
`PI Last Name`	`str`	required, non-empty	NIH eRA Commons / NSF Research.gov PI of record
`Department Code`	`str`	required	Institutional org/department mapping
`Direct Costs`	`float`	required, `>= 0`	2 CFR 200 cost principles
`Indirect Cost Rate`	`float`	`0 ≤ r ≤ negotiated cap`	2 CFR 200 indirect cost agreement
`Start Date`	`datetime64[ns]`	required, ISO-8601 on export	NSF/NIH period of performance
`End Date`	`datetime64[ns]`	required, `>= Start Date`	NSF/NIH period of performance

Verification

Confirm a run behaved correctly before trusting its output:

Checksum parity. The 8-character hash suffix on each output CSV must match the compute_sha256 value recorded for the source file in grant_parsing_audit.log. This pins every export back to the exact byte stream it came from.
Dry-run idempotency. Run the pipeline twice back-to-back. The second pass must log Skipping … output exists for every already-parsed file and write nothing — proof the parse is deterministic.
Row-count reconciliation. Compare the data-row count of each output CSV against ws.max_row minus header and any rows dropped for invalid dates. A gap that is not explained by a logged drop means a row vanished silently — a defect, not an accepted state.
Quarantine reconciliation. Every workbook in quarantine/ must have a matching ERROR/WARNING pair in the audit log with the original filename and timestamp.

Troubleshooting

Three gotchas specific to parsing university Excel templates:

KeyError: Missing mandatory compliance fields. A sponsor renamed a column or nested it under a new merged group, so resolve_merged_headers() produced a label the schema does not recognize. Inspect the function’s output, then either update CANONICAL_SCHEMA or add an alias-resolution map ({"PI Surname": "PI Last Name"}) before validation. For the canonical mapping rules, see How to Map NIH Grant Schemas to Internal Databases.
ValueError: could not convert string to float. Currency symbols, thousands separators, or N/A sit in a cost column. Add a pre-cast cleaner — df["Direct Costs"] = df["Direct Costs"].replace({r"[$,]": ""}, regex=True) then coerce — and re-run; the quarantined file re-parses cleanly on the next pass.
NaT in Start/End Date, or silent row drops. Non-standard date formats (DD-Mon-YY vs MM/DD/YYYY) coerce to NaT and are dropped, or a hidden row / merged region spanning the data area shifts alignment. Pass dayfirst=True or an explicit format to pd.to_datetime, and verify ws.max_row against the visible data before adjusting max_header_rows. When the primary template source is structurally degraded, divert acquisition through your Fallback Routing Protocols until it is reconciled.

Frequently asked questions

Why use openpyxl for headers instead of pandas' header argument?

pd.read_excel(header=[0, 1]) can stack rows into a MultiIndex, but it cannot see merged-cell ranges — it leaves NaN in the spanned columns. openpyxl exposes ws.merged_cells.ranges, which lets you propagate each group's anchor value across the columns it covers. That is the only reliable way to flatten a "Personnel" header that merges four budget columns into a predictable, validatable label set.

Is it safe if the batch runs twice on the same directory?

Yes. Each output filename embeds the first eight characters of the source file's SHA-256 checksum. On the second run the pipeline sees the output already exists, logs an idempotent skip, and writes nothing. A re-sent file with even one changed byte produces a new checksum and is parsed fresh, so corrections are never missed and duplicates are never created.

Why copy failed files to quarantine instead of moving them?

The source directory is treated as immutable for audit purposes — moving a file would alter the evidentiary record. shutil.copy2 preserves the original in place (with its metadata) while placing a timestamped copy in the quarantine queue for compliance review, so the chain of custody under 2 CFR 200.303 internal controls stays intact.

Up to the parent topic: CSV and Excel Batch Parsing
Automating Daily Grant Portal Polling with Python requests — the polled-record sibling that shares this canonical schema
Building Async Batch Processors for Inventory Updates — decoupling heavy parses from the commit path
Schema Validation Pipelines — the validation gates this parser feeds
How to Map NIH Grant Schemas to Internal Databases — canonical field-mapping rules

Parsing complex university Excel grant templates with pandas

Problem statement #

Prerequisites #

Step-by-step implementation #

Step 1 — Pin the canonical schema and audit logging #

Step 2 — Fingerprint the file for idempotency #

Step 3 — Resolve merged, multi-row headers #

Step 4 — Validate and cast against the canonical schema #

Step 5 — Orchestrate the idempotent batch run #

Schema and field reference #

Verification #

Troubleshooting #

Frequently asked questions #

Related #

Problem statement

Prerequisites

Step-by-step implementation

Step 1 — Pin the canonical schema and audit logging

Step 2 — Fingerprint the file for idempotency

Step 3 — Resolve merged, multi-row headers

Step 4 — Validate and cast against the canonical schema

Step 5 — Orchestrate the idempotent batch run

Schema and field reference

Verification

Troubleshooting

Frequently asked questions

Related