ncx-infra-controller-core Deep-Dive Analysis

Design Decisions

1. JSONB-Backed Versioned State Machines

What it is: Every stateful entity (machine, rack, switch, network segment, DPA interface, power shelf) stores its controller state as a JSONB column (controller_state) paired with a version string (controller_state_version). State transitions are recorded in append-only *_state_history tables.

Where it appears: crates/api-db/migrations/ (304 migration files), crates/api/src/state_controller/ (20 modules), all entity tables in PostgreSQL.

What problem it solves: Complex state machines with deeply nested sub-states (e.g., ManagedHostState has 18+ top-level variants, each with sub-states like DpuDiscoveringStates, ValidationState, HostReprovisionState) cannot be modeled as simple SQL enums. JSONB allows rich, evolving state structures without ALTER TABLE migrations for every new sub-state.

What would break without it: Adding new states or sub-states would require database migrations and downtime. The version string enables optimistic concurrency control — without it, two controllers processing the same machine could overwrite each other's state transitions.

Alternatives: Dedicated state tables per entity type (explosion of tables), external state stores like Redis (loses transactional consistency with domain data), event sourcing (much higher complexity for the same guarantees).

2. Distributed State Controller with Queue Tables

What it is: Each entity type has a *_controller_queued_objects table with processed_by and processing_started_at columns. Controllers claim work items by setting these fields, and named lock tables (*_controller_lock) prevent concurrent processing of the same entity.

Where it appears: crates/api/src/state_controller/controller.rs, queue tables for machine, rack, switch, network segment, IB partition, DPA interface, power shelf, attestation.

What problem it solves: Multiple API server replicas must coordinate state machine processing without an external message broker. PostgreSQL-based queuing gives exactly-once processing semantics with the same transactional guarantees as the state updates.

What would break without it: Race conditions where two controllers advance the same machine through incompatible states. State corruption when a controller crashes mid-transition (the lock/timeout mechanism detects stale claims).

Alternatives: External message queue (RabbitMQ, NATS) — adds operational dependency and two-phase commit complexity. Advisory locks — less visible for debugging. The chosen approach keeps everything in PostgreSQL, simplifying operations.

3. DPU-Enforced Zero-Trust Isolation

What it is: All network isolation and security enforcement happens on the Bluefield DPU (Data Processing Unit), not on the host machine. The host is explicitly treated as untrustworthy. The DPU agent manages VPC networking, firewall rules, and data plane forwarding.

Where it appears: crates/agent/ (DPU agent), crates/dpf/ (data plane forwarding), crates/api/src/state_controller/dpa_interface/, design principles in book/src/README.md.

What problem it solves: In multi-tenant bare-metal environments, the host OS can be compromised or misconfigured. By enforcing isolation at the DPU level (separate ARM64 processor with its own OS), tenant workloads cannot bypass network security even with root access on the host.

What would break without it: A compromised host could sniff other tenants' traffic, bypass VPC isolation, or attack the management plane. The entire security model depends on the DPU being the trust anchor.

Alternatives: Software-defined networking on the host (vulnerable to root compromise), hardware switches only (insufficient per-machine granularity), hypervisor-based isolation (not applicable to bare-metal).

4. Monorepo with 65 Specialized Crates

What it is: The entire system (20 binaries, all libraries) lives in a single Cargo workspace with 65 crates. Each crate has a focused responsibility: api-model for domain types, api-db for database access, rpc for protobuf definitions, etc.

Where it appears: Cargo.toml (workspace root), crates/ directory.

What problem it solves: Compile-time type safety across all service boundaries (gRPC types, DB models, domain types shared via crate dependencies). A single CI pipeline validates the entire system. Refactoring domain types propagates errors everywhere they need fixing.

What would break without it: Version skew between services, runtime serialization errors instead of compile-time type errors, duplicated type definitions that drift apart.

Alternatives: Polyrepo (version management nightmare for 20 services), fewer larger crates (longer compile times, less clear boundaries). The 65-crate approach optimizes for both compile granularity and clear separation of concerns.

5. Append-Only History with Auto-Cleanup

What it is: Every state transition is logged to *_state_history tables. Database triggers automatically trim entries to keep the most recent 250 per entity. Health observations similarly use append-only machine_health_history.

Where it appears: Migration files creating history tables and cleanup triggers, crates/api-db/src/sql/machine_snapshots.sql.template.

What problem it solves: Full audit trail for debugging machine lifecycle issues (why did this machine go to Failed state?) without unbounded storage growth. The 250-entry limit prevents history tables from dominating database size in fleets with thousands of machines.

What would break without it: Without history: impossible to debug intermittent failures or understand why a machine is in a particular state. Without cleanup: history tables grow unbounded, slow down queries, and consume storage.

Alternatives: External log aggregation (Loki/ELK) — less queryable, separate system. Time-based retention — unpredictable storage. The count-based trigger approach is simple and predictable.

6. SQLx Compile-Time Checked Queries

What it is: All database queries use SQLx macros that validate SQL against the actual PostgreSQL schema at compile time. Query parameters and return types are statically verified.

Where it appears: crates/api-db/src/ (96 files), sqlx-data.json for offline verification.

What problem it solves: Eliminates an entire class of runtime SQL errors (wrong column names, type mismatches, missing tables). With 304 migrations and complex JSONB queries, this is critical for correctness.

What would break without it: Schema changes that break queries would only be discovered at runtime. In a system managing thousands of machines, a SQL error during state transition could leave machines in inconsistent states.


Invariants

State Machine Constraints

Database Constraints

Authorization Invariants

Concurrency & Consistency

Idempotency


Failure Modes

Machine Provisioning Flow

Step What Can Fail Current Handling Correctness
DHCP Discovery No DHCP response, duplicate MAC Machine stays unpowered; DHCP server retries. Duplicate MACs detected via DB unique constraint. Correct
BMC Exploration BMC unreachable, wrong credentials Exploration retries with backoff. Credential rotation via UpdateMachineCredentials. Machine stays in Created. Correct
DPU Discovery No DPUs found, DPU firmware incompatible Transitions to Failed with FailureCause::Discovery. Admin can trigger DPU reprovisioning. Correct
DPU OS Install (PXE) PXE server down, image corrupt DPU init retries. Timeout triggers Failed state. PXE health monitored separately. Partial — no automatic PXE failover
Host Init BIOS config failure, OS boot loop Retry with BMC reset. HostReprovisionState tracks firmware upgrade attempts with retry_count. Correct
Validation GPU test failure, NVMe errors Failed with FailureCause::MachineValidation. Machine quarantined. Specific test results stored. Correct
TPM Attestation Measurements don't match profile Failed with MeasurementsFailedSignatureCheck, MeasurementsRevoked, or MeasurementsCAValidationFailed. Correct — explicit failure variants

Instance Allocation Flow

Step What Can Fail Current Handling Correctness
Machine Selection No Ready machines available AllocateInstance returns error. Admin must provision more machines or wait for cleanup. Correct
VPC Network Config VNI exhaustion, DPU agent unreachable Instance stays in provisioning state. DPU agent reconnection retries. VNI pool managed by resource_pool table. Correct
DHCP Lease Update Kea DHCP server down Provisioning retries. DHCP server has its own health monitoring. Partial — extended outage blocks provisioning
IB Partition Setup UFM unreachable, partition key conflict IB partition state machine handles retries. Conflicts reported as provisioning failure. Correct

Cascading Failures


Architecture Strengths

  1. Compile-time safety across boundaries: Rust's type system + SQLx compile-time checks + protobuf codegen means entire classes of runtime errors are impossible. Schema changes are caught at build time.
  2. PostgreSQL as single source of truth: No distributed state to reconcile. JSONB flexibility with relational guarantees. Transactions ensure atomicity of complex state transitions.
  3. Zero-trust by design: DPU-enforced isolation is architecturally sound. The security model doesn't depend on host OS integrity, which is the correct assumption for multi-tenant bare-metal.
  4. Comprehensive audit trail: State history tables + health history provide complete observability into machine lifecycle. Critical for debugging fleet-wide issues.
  5. Modular crate structure: 65 crates with clear dependencies enable focused development. Teams can own specific crates without understanding the entire codebase.
  6. Operational simplicity: Despite the complexity, the deployment is just PostgreSQL + N Rust binaries + Kubernetes. No external message brokers, no Redis, no distributed consensus systems.

Architecture Risks

  1. PostgreSQL as bottleneck: All state transitions, queue polling, and history writes go through a single PostgreSQL instance. At fleet scale (10K+ machines with frequent state transitions), this could become a bottleneck. Mitigation: read replicas, connection pooling, queue polling optimization.
  2. State machine complexity: 18+ top-level machine states with deeply nested sub-states (e.g., HostReprovisionState has 8 variants, each with sub-states). Understanding all valid transitions requires reading multiple source files. Risk of unreachable states or unhandled transitions.
  3. Admin CLI size: 816 files mirroring the API surface. Changes to gRPC service definitions require updating both server handlers and CLI commands. High maintenance burden.
  4. JSONB schema evolution: While JSONB avoids ALTER TABLE, the state structs in Rust code must handle deserialization of old state formats. Backward-compatible state evolution requires careful engineering.
  5. Single-point failures in boot chain: PXE server, DHCP server, and DNS server are each single services. If PXE fails, no new machines can be provisioned. Mitigation: Kubernetes restarts, but no active-active redundancy.