OCR Confidence Scoring & Fallbacks

Stage Boundaries & Pipeline Positioning

The OCR Confidence Scoring & Fallbacks stage operates as the deterministic control layer between raw document digitization and structured data normalization. Upstream, it receives raw OCR payloads and page-level metadata from the PDF & Scanned Log OCR Processing stage. Downstream, validated and routed outputs feed into field-level parsing, schema enforcement, and parts-traceability indexing. This stage does not perform extraction; it evaluates extraction reliability, enforces routing policies, and triggers fallback protocols when probabilistic outputs fall below airworthiness-grade thresholds. All state transitions, confidence vectors, and routing decisions are logged to immutable audit trails to satisfy FAA AC 120-78B and EASA Part-145 recordkeeping requirements.

Confidence Metric Architecture

Aviation maintenance records contain high-stakes identifiers: part numbers, serial numbers, AD/SB references, and life-limited component hours. Treating OCR output as deterministic introduces unacceptable compliance risk. Confidence scoring must operate across three granularities:

Character-level — derived directly from engine probability matrices. Tesseract provides per-word confidence values in its data output; AWS Textract and Azure AI Document Intelligence return per-block confidence percentages (0–100 scale).
Field-level — aggregates character scores using a weighted harmonic mean. Critical zones (ATA chapter blocks, release-to-service signatures) receive higher penalty weights for low-confidence characters. Overlapping bounding boxes trigger spatial conflict resolution before aggregation.
Document-level — computed as the harmonic mean of all field-level scores. This formulation prevents high-confidence boilerplate text from masking degraded critical fields, ensuring localized scan artifacts or ink bleed-through trigger appropriate routing.

The weighted harmonic mean used for field aggregation is:

$C_\text{field} = \frac{\sum_{i=1}^{n} w_i}{\sum_{i=1}^{n} \dfrac{w_i}{c_i}}$

where $c_i$ is the per-character confidence (0–100) and $w_i$ is the criticality weight for that character position (higher for ATA-chapter and release-to-service zones). The harmonic mean is dominated by the lowest terms, which is precisely the desired bias for airworthiness data where a single ambiguous digit in a serial number can void traceability.

Intermediate confidence payloads are serialized alongside raw OCR JSON to enable deterministic replay during compliance audits or model retraining cycles.

Threshold Configuration & Routing Logic

Thresholds are not static constants. They are dynamically calibrated per OEM form type, scan resolution (DPI), and regulatory criticality. The routing matrix enforces strict tiered behavior:

High confidence (≥ 92%) — bypasses manual review; routes directly to downstream normalization pipelines unless structural schema validation fails.
Medium confidence (75–91%) — triggers field-level validation gates; executes cross-reference checks against maintenance history databases, part master records, and aircraft configuration baselines; ambiguous fields are flagged for secondary validation.
Low confidence (< 75%) — immediate diversion to fallback protocols; automated extraction is suspended to prevent false-positive traceability records.

Adaptive calibration uses a rolling 30-day confidence distribution per form template. Threshold adjustments require operator authentication, documented justification, and effective timestamping to prevent threshold drift from masking systemic scanner degradation or OCR engine updates.

Fallback Execution Protocols

Fallback execution follows a strict, state-preserving hierarchy to maintain pipeline throughput while guaranteeing data integrity:

Secondary engine pass — low-confidence pages are re-processed using an alternative OCR engine. Preprocessing pipelines apply inverted operations (aggressive deskew, contrast normalization, adaptive thresholding). Field-level outputs are compared; divergence exceeding 15% triggers immediate escalation.
Template-assisted parsing — coordinate-anchored extraction applies known OEM layout schemas. Fixed fields (logbook headers, stamp locations, signature blocks) bypass probabilistic OCR entirely, relying on deterministic bounding-box mapping.
Human-in-the-loop (HITL) queue — unresolved records route to a secure, RBAC-controlled review interface. Ambiguous regions are visually highlighted, highest-confidence candidates are pre-populated, and airworthiness-critical entries require dual-approval before release.

State continuity is preserved across all fallback stages. Original scan hashes, engine metadata, and confidence vectors remain attached to the record payload to prevent data loss during escalation.

Schema Validation & Error Handling Integration

All OCR outputs, regardless of confidence tier, must pass strict schema validation before entering the Regex & NLP Field Extraction stage. Validation enforces:

Data type constraints (serial numbers must match alphanumeric patterns; dates must be ISO 8601 compliant)
Referential integrity checks against approved vendor lists and type certificate data sheets
Mandatory field presence for release-to-service documentation

Validation failures generate structured error payloads containing field paths, expected vs. actual values, and recommended remediation steps. These payloads are routed to the error-handling subsystem without blocking the broader pipeline.

Production-Ready Python Implementation

The following module implements the routing matrix, confidence aggregation, and fallback state management using Python standard libraries with strict typing.

import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Any, Dict, List, Optional


logger = logging.getLogger(__name__)


class ConfidenceTier(str, Enum):
    HIGH = "HIGH"
    MEDIUM = "MEDIUM"
    LOW = "LOW"


class FallbackStage(str, Enum):
    NONE = "NONE"
    SECONDARY_ENGINE = "SECONDARY_ENGINE"
    TEMPLATE_ASSISTED = "TEMPLATE_ASSISTED"
    HITL_QUEUE = "HITL_QUEUE"


@dataclass
class CharConfidence:
    char: str
    confidence: float  # 0–100
    bbox: Dict[str, float]


@dataclass
class FieldConfidence:
    """
    Aggregates per-character confidence scores using a weighted harmonic mean.

    critical_weight: multiplier applied to every character's confidence before
    the harmonic mean is computed. A value > 1.0 makes the field *harder* to
    pass (low scores have more impact) relative to standard fields.
    """

    field_name: str
    char_scores: List[CharConfidence]
    critical_weight: float = 1.0
    aggregated_score: float = field(init=False)

    def __post_init__(self) -> None:
        if not self.char_scores:
            self.aggregated_score = 0.0
            return

        # Weighted harmonic mean: sum(w_i) / sum(w_i / c_i)
        weights = [self.critical_weight] * len(self.char_scores)
        confidences = [c.confidence for c in self.char_scores]

        # Guard against zero confidence (avoid division by zero)
        denominator = sum(
            w / c if c > 0 else float("inf")
            for w, c in zip(weights, confidences)
        )
        if denominator == 0 or denominator == float("inf"):
            self.aggregated_score = 0.0
        else:
            self.aggregated_score = sum(weights) / denominator


@dataclass
class OCRConfidencePayload:
    document_id: str
    ocr_engine: str
    scan_hash: str
    field_confidences: List[FieldConfidence]
    document_score: float = field(init=False)
    routing_tier: ConfidenceTier = field(init=False)
    fallback_stage: FallbackStage = FallbackStage.NONE
    timestamp: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

    def __post_init__(self) -> None:
        scores = [f.aggregated_score for f in self.field_confidences]
        if not scores:
            self.document_score = 0.0
        else:
            # Document-level harmonic mean
            n = len(scores)
            denom = sum(1.0 / s if s > 0 else float("inf") for s in scores)
            self.document_score = n / denom if denom > 0 else 0.0
        self.routing_tier = self._determine_tier()

    def _determine_tier(self) -> ConfidenceTier:
        if self.document_score >= 92.0:
            return ConfidenceTier.HIGH
        elif self.document_score >= 75.0:
            return ConfidenceTier.MEDIUM
        return ConfidenceTier.LOW


class ConfidenceRouter:
    def __init__(self, adaptive_thresholds: Dict[str, float]):
        self.adaptive_thresholds = adaptive_thresholds
        self._log = logging.getLogger(f"{__name__}.Router")

    def evaluate_and_route(self, payload: OCRConfidencePayload) -> Dict[str, Any]:
        tier = payload.routing_tier
        self._log.info(
            "Routing document %s | tier=%s score=%.2f engine=%s",
            payload.document_id, tier, payload.document_score, payload.ocr_engine,
        )

        routing_result: Dict[str, Any] = {
            "document_id": payload.document_id,
            "tier": tier,
            "fallback_stage": payload.fallback_stage,
            "next_stage": "schema_validation",
            "requires_human_review": False,
        }

        if tier == ConfidenceTier.HIGH:
            routing_result["next_stage"] = "normalization_pipeline"
        elif tier == ConfidenceTier.MEDIUM:
            routing_result["next_stage"] = "cross_reference_validation"
        else:  # LOW
            routing_result["next_stage"] = "fallback_execution"
            routing_result["requires_human_review"] = True
            payload.fallback_stage = self._initiate_fallback(payload)

        return routing_result

    def _initiate_fallback(self, payload: OCRConfidencePayload) -> FallbackStage:
        self._log.warning(
            "Low confidence for %s (score=%.2f). Initiating secondary engine pass.",
            payload.document_id, payload.document_score,
        )
        # In production, dispatch an async job to the secondary OCR engine queue.
        return FallbackStage.SECONDARY_ENGINE

Compliance & Audit Traceability

Aviation MRO pipelines must maintain unbroken chain-of-custody for maintenance records. Confidence scoring outputs are treated as regulatory artifacts. Every routing decision, threshold override, and fallback escalation is appended to an append-only audit log containing:

Original scan SHA-256 hash
OCR engine version and preprocessing parameters
Confidence vector snapshots per field
Operator ID for manual threshold adjustments
Timestamp and timezone-accurate routing state

These logs integrate directly with fleet maintenance tracking systems to satisfy FAA and EASA traceability mandates. When confidence scores fall below operational thresholds, the system automatically generates discrepancy reports linked to the affected component serial numbers, preventing unverified data from propagating into airworthiness release documentation.

Automated Log Ingestion & Parsing Workflows

Related pages