Fallback Routing for Offline MRO Sites

Network degradation at remote hangars, line stations, and forward operating bases requires deterministic fallback routing to preserve parts traceability and maintenance continuity. When primary connectivity to the central MRO data lake is interrupted, edge ingestion nodes must transition to autonomous local processing without compromising data integrity or regulatory compliance. This workflow defines the procedural architecture for the offline fallback routing stage, emphasizing strict schema validation, deterministic error handling, and cryptographically verifiable audit trails.

Pipeline Stage Boundaries & Dependencies

This stage operates strictly at the network edge. It does not perform cloud reconciliation, master data synchronization, or predictive maintenance analytics. Its sole responsibility is to intercept, validate, persist, and cryptographically sign maintenance and parts payloads during connectivity outages.

Upstream dependencies:

Technician mobile workstations (ruggedized iOS/Android devices)
IoT component readers (RFID/barcode scanners, torque-wrench telemetry)
Primary API gateway heartbeat probes
Local network switch/router health metrics

Downstream dependencies:

Central MRO data lake ingestion endpoints
Compliance audit ledger synchronization services
Fleet reliability analytics pipelines
Offline sync strategies for remote hangars reconciliation workers

The boundary is enforced through explicit state transitions: CONNECTED → DEGRADED → ISOLATED → RECONCILING → CONNECTED. No payload crosses the edge boundary until the ISOLATED state is fully stabilized and the local queue reaches deterministic consistency.

Procedural Architecture

1. Connectivity Threshold Monitoring & Edge Isolation

Fallback routing activates when heartbeat probes to the primary API gateway exceed configurable latency thresholds (default: 3 consecutive probes > 2,000 ms) or return consecutive HTTP 5xx/timeout responses. The edge node must immediately isolate the local message bus, disable synchronous external calls, and switch to asynchronous local persistence. This isolation prevents partial commits and ensures that all subsequent maintenance actions are queued locally. The routing topology aligns with the broader Aviation MRO Logbook Architecture & Standards Mapping framework, which dictates how distributed edge nodes maintain state consistency during network partitions.

2. Local Schema Validation & Payload Enforcement

Before queuing any maintenance record, parts receipt, or component removal log, the edge processor must enforce strict schema validation against the canonical MRO data contract. Mandatory fields — including ATA chapter codes, component serial numbers, work order identifiers, technician digital signatures, and UTC timestamps — must pass type checking and range validation. Reject payloads that fail validation immediately; do not queue malformed records. Instead, route them to a local dead-letter queue (DLQ) and generate a structured error report containing the validation path, expected type, and received value. This prevents schema drift from propagating to the central repository upon reconnection.

3. Error Handling & Queue Persistence

The local fallback queue must use an ACID-compliant embedded database with write-ahead logging (WAL) to guarantee durability during power fluctuations. Implement exponential backoff with jitter for retry attempts on transient local failures. Each queued payload must be assigned a UUIDv4 idempotency key to prevent duplicate processing during sync reconciliation. When validation or persistence errors occur, capture the full stack trace, sanitize any PII, and log the event at ERROR severity. Implement a circuit breaker for downstream local services; if the local schema validator or queue writer fails three consecutive times, trigger a hard freeze and escalate to the maintenance control center via secondary telemetry channels.

4. Cryptographic Audit Trail Generation

Regulatory frameworks mandate immutable proof of maintenance actions, regardless of network state. Upon successful local persistence, each payload must be hashed using SHA-256 and signed with an Ed25519 edge certificate. The resulting signature bundle is stored alongside the payload in the local queue. This cryptographic chain enables auditors to verify that no records were altered, backdated, or injected during the offline period. Compliance alignment with FAA Part 145 Recordkeeping Standards and EASA Part-M Compliance Mapping requires that edge-generated signatures remain cryptographically bound to the original technician identity and UTC timestamp.

Production-Ready Python Implementation

The following implementation demonstrates a production-grade fallback routing pipeline stage using asyncio, sqlite3 with WAL mode, JSON schema validation, exponential backoff, and Ed25519 cryptographic signing. It is designed for deployment on resource-constrained edge hardware (e.g., Raspberry Pi CM4, Intel NUC, or ruggedized industrial controllers).

import asyncio
import hashlib
import json
import logging
import os
import sqlite3
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional

import jsonschema
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ",
)
logger = logging.getLogger("mro_fallback_router")

# Canonical MRO Schema Contract (simplified for demonstration)
MRO_SCHEMA = {
    "type": "object",
    "required": ["ata_chapter", "serial_number", "work_order_id", "technician_sig", "timestamp_utc"],
    "properties": {
        "ata_chapter":     {"type": "string", "pattern": "^[0-9]{2}$"},
        "serial_number":   {"type": "string", "minLength": 5},
        "work_order_id":   {"type": "string"},
        "technician_sig":  {"type": "string", "minLength": 10},
        "timestamp_utc":   {"type": "string", "format": "date-time"},
        "action_type":     {"type": "string", "enum": ["INSTALL", "REMOVE", "INSPECT", "REPAIR"]},
    },
    "additionalProperties": False,
}


@dataclass
class FallbackQueue:
    db_path: str = "mro_fallback.db"
    max_retries: int = 3
    circuit_breaker_threshold: int = 3
    _consecutive_failures: int = field(default=0, init=False, repr=False)
    _circuit_open: bool = field(default=False, init=False, repr=False)

    def __post_init__(self) -> None:
        self.db = sqlite3.connect(self.db_path, check_same_thread=False)
        self.db.execute("PRAGMA journal_mode=WAL;")
        self.db.execute("PRAGMA synchronous=NORMAL;")
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS fallback_queue (
                id TEXT PRIMARY KEY,
                payload TEXT NOT NULL,
                signature_hex TEXT NOT NULL,
                status TEXT DEFAULT 'PENDING',
                created_utc TEXT NOT NULL,
                retry_count INTEGER DEFAULT 0
            )
        """)
        self.db.commit()

    def _validate_payload(self, payload: Dict[str, Any]) -> bool:
        try:
            jsonschema.validate(instance=payload, schema=MRO_SCHEMA)
            return True
        except jsonschema.ValidationError as e:
            logger.error("Schema validation failed: %s | Path: %s", e.message, list(e.absolute_path))
            return False

    def _sign_payload(self, payload_str: str, private_key: Ed25519PrivateKey) -> str:
        """
        Sign the SHA-256 digest of the payload with an Ed25519 private key.
        In production, the private key must be stored in an HSM or secure enclave.
        """
        digest = hashlib.sha256(payload_str.encode("utf-8")).digest()
        return private_key.sign(digest).hex()

    async def enqueue(
        self, payload: Dict[str, Any], private_key: Ed25519PrivateKey
    ) -> Optional[str]:
        if self._circuit_open:
            logger.critical("Circuit breaker OPEN. Hard freeze active. Escalate to MCC.")
            return None

        if not self._validate_payload(payload):
            logger.warning("Payload routed to local DLQ due to validation failure.")
            return None

        payload_str = json.dumps(payload, sort_keys=True)
        signature_hex = self._sign_payload(payload_str, private_key)
        record_id = str(uuid.uuid4())
        utc_now = datetime.now(timezone.utc).isoformat()

        try:
            with self.db:
                self.db.execute(
                    "INSERT INTO fallback_queue (id, payload, signature_hex, created_utc) "
                    "VALUES (?, ?, ?, ?)",
                    (record_id, payload_str, signature_hex, utc_now),
                )
            self._consecutive_failures = 0
            logger.info("Payload enqueued: %s", record_id)
            return record_id
        except sqlite3.Error as e:
            self._consecutive_failures += 1
            logger.error("Database write failed: %s", e)
            if self._consecutive_failures >= self.circuit_breaker_threshold:
                self._circuit_open = True
            return None

    async def process_pending(self, private_key: Ed25519PrivateKey) -> List[str]:
        if self._circuit_open:
            return []

        cursor = self.db.execute(
            "SELECT id, payload FROM fallback_queue "
            "WHERE status='PENDING' ORDER BY created_utc ASC LIMIT 50"
        )
        pending = cursor.fetchall()
        if not pending:
            return []

        processed_ids: List[str] = []
        for rec_id, payload_str in pending:
            try:
                await asyncio.sleep(0.1)  # Simulate async API POST in production
                with self.db:
                    self.db.execute(
                        "UPDATE fallback_queue SET status='SYNCED' WHERE id=?", (rec_id,)
                    )
                processed_ids.append(rec_id)
                logger.info("Synced record: %s", rec_id)
            except Exception as e:
                logger.warning("Sync failed for %s: %s", rec_id, e)
                retry_row = self.db.execute(
                    "SELECT retry_count FROM fallback_queue WHERE id=?", (rec_id,)
                ).fetchone()
                retry_count = retry_row[0] if retry_row else 0
                # Exponential backoff with jitter; cap at 30 seconds
                backoff = min(2 ** retry_count + (os.urandom(1)[0] / 255.0), 30.0)
                await asyncio.sleep(backoff)
                with self.db:
                    self.db.execute(
                        "UPDATE fallback_queue SET retry_count = retry_count + 1 WHERE id=?",
                        (rec_id,),
                    )
        return processed_ids


async def main() -> None:
    private_key = Ed25519PrivateKey.generate()
    queue = FallbackQueue()

    payloads = [
        {
            "ata_chapter": "32", "serial_number": "SN-8842A",
            "work_order_id": "WO-9912", "technician_sig": "CERT-TECH-01",
            "timestamp_utc": "2024-05-20T14:30:00Z", "action_type": "REMOVE",
        },
        {
            "ata_chapter": "71", "serial_number": "SN-1102",
            "work_order_id": "WO-9913", "technician_sig": "CERT-TECH-02",
            "timestamp_utc": "2024-05-20T14:35:00Z", "action_type": "INSPECT",
        },
        # Will fail schema validation (INVALID ata_chapter, short serial_number)
        {
            "ata_chapter": "INVALID", "serial_number": "SN9",
            "work_order_id": "WO-9914", "technician_sig": "CERT-TECH-03",
            "timestamp_utc": "2024-05-20T14:40:00Z",
        },
    ]

    for p in payloads:
        await queue.enqueue(p, private_key)

    synced = await queue.process_pending(private_key)
    pending_count = queue.db.execute(
        "SELECT COUNT(*) FROM fallback_queue WHERE status='PENDING'"
    ).fetchone()[0]
    logger.info("Synced %d records. Remaining PENDING: %d", len(synced), pending_count)


if __name__ == "__main__":
    asyncio.run(main())

Operational & Compliance Integration

The fallback routing stage must be treated as a regulated boundary condition. Fleet managers should configure heartbeat thresholds based on historical RF propagation data at each line station, typically ranging from 1,500 ms to 3,000 ms depending on terrain and interference profiles. MRO engineers must validate that the embedded database remains within storage quotas and implement automated log rotation to prevent disk exhaustion during extended outages.

Compliance teams should verify that cryptographic edge certificates are rotated quarterly and that private keys never persist beyond the hardware security module (HSM) or secure enclave. The deterministic idempotency model ensures that duplicate submissions during reconnection windows do not violate audit trails. All edge-generated records must satisfy the immutability and traceability requirements outlined in FAA Part 145 Recordkeeping Standards and align with EASA Part-M Compliance Mapping for continuing airworthiness documentation.

Python automation builders should leverage the sqlite3 module’s native WAL support for crash-safe persistence, as documented in the official Python SQLite documentation. When deploying to containerized edge environments, mount the database directory to persistent block storage and configure the orchestrator to preserve the mro_fallback.db file across pod restarts. Circuit breaker thresholds and retry backoff parameters must be exposed via environment variables to allow dynamic tuning without container rebuilds.

In this section