Building Async Batch Processors for Inventory Updates

On this page

Problem statement
Prerequisites
Step-by-step implementation
Step 1 — Structured audit logging
Step 2 — The institutional schema as a Pydantic contract
Step 3 — The batch result record
Step 4 — Idempotent chunk commit with backoff
Step 5 — Validate, route to quarantine, and orchestrate concurrently
Schema reference
Verification
Troubleshooting
Related

Problem statement

You need an unattended asyncio worker that takes a university inventory manifest (a CSV of assets, chemicals, and equipment tied to federal grant codes), validates every row against an institutional schema, commits the valid rows to the central ledger concurrently without ever double-posting on a network retry, and routes every rejected row to a reviewable quarantine — so that NIH/NSF asset reporting and OSHA/EPA chemical manifests stay deterministic regardless of submission timing.

This task sits under Async Processing & Queue Management, part of the broader Automated Ingestion & Data Sync Workflows practice. The batch processor is the execution stage: it consumes manifests that have already been acquired by upstream pollers, applies the validation contract maintained in the Schema Validation Pipelines, and commits only clean records — preserving the policy-versus-implementation separation established in the Grant Lifecycle Architecture Design.

Figure: chunks are validated and committed concurrently; duplicates are skipped and failures fall through to quarantine.

Prerequisites

Before running the processor, confirm the environment and the policy configuration it depends on:

Python 3.10+ — the code uses set[str] generics, X | Y unions in helpers, and datetime.now(timezone.utc).
Libraries: pip install "pydantic>=2.5" "httpx>=0.27" "pandas>=2.0". Pydantic enforces the schema, httpx provides the async client, and pandas reads the manifest.
Environment variables (never hard-code endpoints or tokens, per the Security Boundary Configuration):
- LEDGER_API_BASE_URL — the inventory ledger API root, e.g. https://ledger.university.example/v1.
- LEDGER_API_TOKEN — a least-privilege token scoped to the inventory/upsert endpoint only.
Policy config: the allowed grant-code prefixes and field constraints must mirror your version-controlled University Policy Mapping Frameworks. They are embedded as Pydantic validators below so they fail closed.
Idempotency store: an in-memory set is fine for a single worker. For a distributed worker pool, back it with Redis or a unique database constraint so duplicates are caught across processes.

Step-by-step implementation

The module is assembled in five runnable steps. Pasted in order, they form a single inventory_processor.py.

Step 1 — Structured audit logging

Compliance review starts with an immutable log. Configure logging before anything else so that every validation and commit decision is timestamped and persisted to disk for federal audit retention.

python

import asyncio
import hashlib
import json
import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

import httpx
import pandas as pd
from pydantic import BaseModel, ConfigDict, ValidationError, field_validator

# Dual handler: human-readable console + durable audit file kept per retention schedule
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    handlers=[logging.FileHandler("inventory_audit.log"), logging.StreamHandler()],
)
logger = logging.getLogger("compliance.inventory_processor")

Step 2 — The institutional schema as a Pydantic contract

Encode policy as a validation model. extra="forbid" rejects unexpected columns so upstream schema drift fails loudly instead of silently corrupting the ledger. The validators enforce the grant-prefix and non-negative-quantity rules that satisfy NIH/NSF reporting and OSHA/EPA manifest standards.

python

class InventoryItem(BaseModel):
    model_config = ConfigDict(extra="forbid")  # unknown columns are a policy violation
    asset_id: str
    grant_code: str
    category: str
    quantity: int
    location: str
    acquisition_date: str

    @field_validator("grant_code")
    @classmethod
    def validate_grant_prefix(cls, v: str) -> str:
        allowed = ("NIH-", "NSF-", "DOE-", "DOD-", "EPA-")
        if not v.startswith(allowed):
            raise ValueError(f"Grant code must begin with institutional prefix: {allowed}")
        return v.upper()

    @field_validator("quantity")
    @classmethod
    def validate_non_negative(cls, v: int) -> int:
        if v < 0:  # negative stock breaks OSHA/EPA hazardous-material manifests
            raise ValueError("Quantity must be non-negative per OSHA/EPA manifest standards")
        return v

Step 3 — The batch result record

Each chunk returns a structured outcome that separates committed rows from quarantined rows and carries a cryptographic fingerprint, so a reviewer can reconcile exactly what was posted.

python

@dataclass
class BatchResult:
    committed: list[dict[str, Any]] = field(default_factory=list)
    quarantined: list[dict[str, Any]] = field(default_factory=list)
    batch_hash: str = ""
    processed_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

Step 4 — Idempotent chunk commit with backoff

The processor never posts the same chunk twice. An Idempotency-Key header lets the ledger collapse duplicate submissions, and transient 5xx/network faults are retried with exponential backoff rather than dropped. A 409 Conflict confirms the record already landed and is treated as success by the caller in Step 5.

python

class AsyncInventoryProcessor:
    def __init__(self, api_base_url: str, chunk_size: int = 500, max_retries: int = 3):
        self.api_base_url = api_base_url.rstrip("/")
        self.chunk_size = chunk_size
        self.max_retries = max_retries
        self._idempotency_store: set[str] = set()

    def _compute_batch_hash(self, records: list[dict[str, Any]]) -> str:
        # Deterministic hash: sort keys so identical payloads always fingerprint identically
        payload = json.dumps(records, sort_keys=True).encode()
        return hashlib.sha256(payload).hexdigest()

    def _generate_idempotency_key(self, record: dict[str, Any]) -> str:
        return f"{record.get('asset_id')}:{record.get('grant_code')}"

    async def _commit_chunk(
        self, chunk: list[dict[str, Any]], client: httpx.AsyncClient
    ) -> list[dict[str, Any]]:
        idem_key = self._compute_batch_hash(chunk)
        headers = {"Idempotency-Key": idem_key, "X-Compliance-Standard": "NIH-NSF-OSHA-EPA"}

        for attempt in range(1, self.max_retries + 1):
            try:
                resp = await client.post(
                    f"{self.api_base_url}/inventory/upsert", json=chunk, headers=headers
                )
                resp.raise_for_status()
                return chunk
            except httpx.HTTPStatusError as e:
                if e.response.status_code == 409:
                    # Idempotency key already committed — safe, not a failure
                    logger.info(f"Chunk {idem_key[:12]} already committed (409); skipping re-post")
                    return chunk
                if 500 <= e.response.status_code < 600:
                    logger.warning(
                        f"Transient server error (attempt {attempt}/{self.max_retries}): "
                        f"{e.response.status_code}"
                    )
                    await asyncio.sleep(2 ** attempt)
                    continue
                raise  # 4xx other than 409 is a hard, non-retryable error
            except httpx.RequestError:
                logger.warning(f"Network drop (attempt {attempt}/{self.max_retries}). Retrying...")
                await asyncio.sleep(2 ** attempt)
                continue
        raise RuntimeError("Max retries exceeded for chunk commit")

Step 5 — Validate, route to quarantine, and orchestrate concurrently

_process_chunk deduplicates, validates each row, and routes failures to quarantine with field-level error traces. process_manifest splits the file into chunks and runs them concurrently with asyncio.gather, capturing exceptions so one bad chunk never aborts the run. This deterministic fallback routing is the same principle codified in the Fallback Routing Protocols.

python

class AsyncInventoryProcessor:  # continued
    async def _process_chunk(
        self, chunk: list[dict[str, Any]], client: httpx.AsyncClient
    ) -> BatchResult:
        result = BatchResult()
        valid_records: list[dict[str, Any]] = []

        for record in chunk:
            key = self._generate_idempotency_key(record)
            if key in self._idempotency_store:
                logger.debug(f"Skipping duplicate idempotency key: {key}")
                continue
            try:
                validated = InventoryItem(**record)
                valid_records.append(validated.model_dump())
                self._idempotency_store.add(key)
            except ValidationError as ve:
                result.quarantined.append(
                    {"original": record, "errors": ve.errors(), "reason": "schema_violation"}
                )

        if valid_records:
            try:
                committed = await self._commit_chunk(valid_records, client)
                result.committed.extend(committed)
            except Exception as e:
                # Never silently drop valid rows — re-route them for manual reconciliation
                logger.error(f"Chunk commit failed. Routing to deterministic fallback: {e}")
                for rec in valid_records:
                    result.quarantined.append(
                        {
                            "original": rec,
                            "errors": [{"msg": "Network/commit failure during chunk submission"}],
                            "reason": "transient_failure",
                        }
                    )

        result.batch_hash = self._compute_batch_hash(valid_records)
        return result

    async def process_manifest(self, file_path: Path) -> list[BatchResult]:
        logger.info(f"Starting async batch processing for: {file_path}")
        df = pd.read_csv(file_path, dtype=str)  # read as strings; Pydantic coerces types
        df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
        records = df.to_dict(orient="records")
        chunks = [records[i : i + self.chunk_size] for i in range(0, len(records), self.chunk_size)]

        async with httpx.AsyncClient(timeout=30.0) as client:
            tasks = [self._process_chunk(chunk, client) for chunk in chunks]
            raw_results = await asyncio.gather(*tasks, return_exceptions=True)

        final_results: list[BatchResult] = []
        for r in raw_results:
            if isinstance(r, Exception):
                logger.error(f"Chunk processing failed with exception: {r}")
                final_results.append(BatchResult(quarantined=[{"error": str(r)}]))
            else:
                final_results.append(r)

        committed = sum(len(r.committed) for r in final_results)
        quarantined = sum(len(r.quarantined) for r in final_results)
        logger.info(f"Completed. Committed: {committed} | Quarantined: {quarantined}")
        return final_results

Run it against a manifest:

python

if __name__ == "__main__":
    import os

    processor = AsyncInventoryProcessor(api_base_url=os.environ["LEDGER_API_BASE_URL"])
    asyncio.run(processor.process_manifest(Path("manifest_2026Q2.csv")))

Schema reference

Every column the processor accepts, its constraint, and the governing source rule:

Field	Type	Constraint	Source rule
`asset_id`	`str`	Required, unique within manifest	Institutional asset register
`grant_code`	`str`	Must start with `NIH-`/`NSF-`/`DOE-`/`DOD-`/`EPA-`; upper-cased	2 CFR 200.302 (financial tracking)
`category`	`str`	Required free text	Institutional inventory taxonomy
`quantity`	`int`	`>= 0`	OSHA 29 CFR 1910 / EPA RCRA manifest integrity
`location`	`str`	Required, standardized building/room code	Institutional space inventory
`acquisition_date`	`str`	Required, ISO `YYYY-MM-DD`	NIH/NSF capital-asset reporting
(any other column)	—	Rejected (`extra="forbid"`)	Schema-drift protection

The composite idempotency key is asset_id:grant_code; the per-chunk Idempotency-Key header is the SHA-256 of the sorted JSON payload.

Verification

Confirm a run behaved correctly before trusting the ledger:

Reconcile the counts. Sum committed and quarantined across the returned BatchResult list — together they must equal the manifest row count minus skipped duplicates. A gap means a chunk raised an unhandled exception; check inventory_audit.log.
Replay for idempotency. Run the same manifest twice. The second run should commit zero new rows (every key is in the store, or the ledger answers 409). A non-zero second commit count means the idempotency key is not unique enough.
Reproduce the audit hash. Re-compute hashlib.sha256(json.dumps(records, sort_keys=True).encode()).hexdigest() for a committed chunk and match it against the batch_hash in the log — this proves the manifest version was processed exactly as submitted.
Inspect quarantine. Open the quarantined payloads; each carries a reason (schema_violation or transient_failure) and field-level errors for the reviewer.

Troubleshooting

Three gotchas specific to this processor:

quantity arrives as a string and fails coercion. Reading the CSV with dtype=str keeps leading zeros and grant codes intact, but Pydantic then needs a numeric quantity. The pd.to_numeric(..., errors="coerce") line turns malformed values into NaN, which Pydantic rejects into quarantine as a schema_violation rather than crashing the chunk — exactly what you want for an audit trail.
Duplicate asset_id across different grant codes is not a duplicate. The key is asset_id:grant_code, so the same physical asset charged to two awards is committed twice by design. If your policy treats asset_id as globally unique, narrow _generate_idempotency_key to the asset alone — otherwise legitimately distinct grant attributions get silently skipped.
In-memory idempotency does not survive a restart. A crashed worker loses self._idempotency_store, so a re-run leans entirely on the ledger’s 409 handling to avoid double-posting. For a multi-process worker pool, move the store to Redis or a unique DB constraint before scaling out, or duplicates will slip through between workers.

Up to the parent: Async Processing & Queue Management
Sibling guide: Parsing Complex University Excel Grant Templates with pandas
Sibling guide: Automating Daily Grant Portal Polling with Python requests
Validation contract: Schema Validation Pipelines
Failure routing: Fallback Routing Protocols

Building Async Batch Processors for Inventory Updates

Problem statement #

Prerequisites #

Step-by-step implementation #

Step 1 — Structured audit logging #

Step 2 — The institutional schema as a Pydantic contract #

Step 3 — The batch result record #

Step 4 — Idempotent chunk commit with backoff #

Step 5 — Validate, route to quarantine, and orchestrate concurrently #

Schema reference #

Verification #

Troubleshooting #

Related #

Problem statement

Prerequisites

Step-by-step implementation

Step 1 — Structured audit logging

Step 2 — The institutional schema as a Pydantic contract

Step 3 — The batch result record

Step 4 — Idempotent chunk commit with backoff

Step 5 — Validate, route to quarantine, and orchestrate concurrently

Schema reference

Verification

Troubleshooting

Related