ncx-infra-controller-core Deep-Dive Analysis

Design Decisions

1. JSONB-Backed Versioned State Machines

What it is: Every stateful entity (machine, rack, switch, network segment, DPA interface, power shelf) stores its controller state as a JSONB column (controller_state) paired with a version string (controller_state_version). State transitions are recorded in append-only *_state_history tables.

Where it appears: crates/api-db/migrations/ (304 migration files), crates/api/src/state_controller/ (20 modules), all entity tables in PostgreSQL.

What problem it solves: Complex state machines with deeply nested sub-states (e.g., ManagedHostState has 18+ top-level variants, each with sub-states like DpuDiscoveringStates, ValidationState, HostReprovisionState) cannot be modeled as simple SQL enums. JSONB allows rich, evolving state structures without ALTER TABLE migrations for every new sub-state.

What would break without it: Adding new states or sub-states would require database migrations and downtime. The version string enables optimistic concurrency control — without it, two controllers processing the same machine could overwrite each other's state transitions.

Alternatives: Dedicated state tables per entity type (explosion of tables), external state stores like Redis (loses transactional consistency with domain data), event sourcing (much higher complexity for the same guarantees).

2. Distributed State Controller with Queue Tables

What it is: Each entity type has a *_controller_queued_objects table with processed_by and processing_started_at columns. Controllers claim work items by setting these fields, and named lock tables (*_controller_lock) prevent concurrent processing of the same entity.

Where it appears: crates/api/src/state_controller/controller.rs, queue tables for machine, rack, switch, network segment, IB partition, DPA interface, power shelf, attestation.

What problem it solves: Multiple API server replicas must coordinate state machine processing without an external message broker. PostgreSQL-based queuing gives exactly-once processing semantics with the same transactional guarantees as the state updates.

What would break without it: Race conditions where two controllers advance the same machine through incompatible states. State corruption when a controller crashes mid-transition (the lock/timeout mechanism detects stale claims).

Alternatives: External message queue (RabbitMQ, NATS) — adds operational dependency and two-phase commit complexity. Advisory locks — less visible for debugging. The chosen approach keeps everything in PostgreSQL, simplifying operations.

3. DPU-Enforced Zero-Trust Isolation

What it is: All network isolation and security enforcement happens on the Bluefield DPU (Data Processing Unit), not on the host machine. The host is explicitly treated as untrustworthy. The DPU agent manages VPC networking, firewall rules, and data plane forwarding.

Where it appears: crates/agent/ (DPU agent), crates/dpf/ (data plane forwarding), crates/api/src/state_controller/dpa_interface/, design principles in book/src/README.md.

What problem it solves: In multi-tenant bare-metal environments, the host OS can be compromised or misconfigured. By enforcing isolation at the DPU level (separate ARM64 processor with its own OS), tenant workloads cannot bypass network security even with root access on the host.

What would break without it: A compromised host could sniff other tenants' traffic, bypass VPC isolation, or attack the management plane. The entire security model depends on the DPU being the trust anchor.

Alternatives: Software-defined networking on the host (vulnerable to root compromise), hardware switches only (insufficient per-machine granularity), hypervisor-based isolation (not applicable to bare-metal).

4. Monorepo with 65 Specialized Crates

What it is: The entire system (20 binaries, all libraries) lives in a single Cargo workspace with 65 crates. Each crate has a focused responsibility: api-model for domain types, api-db for database access, rpc for protobuf definitions, etc.

Where it appears: Cargo.toml (workspace root), crates/ directory.

What problem it solves: Compile-time type safety across all service boundaries (gRPC types, DB models, domain types shared via crate dependencies). A single CI pipeline validates the entire system. Refactoring domain types propagates errors everywhere they need fixing.

What would break without it: Version skew between services, runtime serialization errors instead of compile-time type errors, duplicated type definitions that drift apart.

Alternatives: Polyrepo (version management nightmare for 20 services), fewer larger crates (longer compile times, less clear boundaries). The 65-crate approach optimizes for both compile granularity and clear separation of concerns.

5. Append-Only History with Auto-Cleanup

What it is: Every state transition is logged to *_state_history tables. Database triggers automatically trim entries to keep the most recent 250 per entity. Health observations similarly use append-only machine_health_history.

Where it appears: Migration files creating history tables and cleanup triggers, crates/api-db/src/sql/machine_snapshots.sql.template.

What problem it solves: Full audit trail for debugging machine lifecycle issues (why did this machine go to Failed state?) without unbounded storage growth. The 250-entry limit prevents history tables from dominating database size in fleets with thousands of machines.

What would break without it: Without history: impossible to debug intermittent failures or understand why a machine is in a particular state. Without cleanup: history tables grow unbounded, slow down queries, and consume storage.

Alternatives: External log aggregation (Loki/ELK) — less queryable, separate system. Time-based retention — unpredictable storage. The count-based trigger approach is simple and predictable.

6. SQLx Compile-Time Checked Queries

What it is: All database queries use SQLx macros that validate SQL against the actual PostgreSQL schema at compile time. Query parameters and return types are statically verified.

Where it appears: crates/api-db/src/ (96 files), sqlx-data.json for offline verification.

What problem it solves: Eliminates an entire class of runtime SQL errors (wrong column names, type mismatches, missing tables). With 304 migrations and complex JSONB queries, this is critical for correctness.

What would break without it: Schema changes that break queries would only be discovered at runtime. In a system managing thousands of machines, a SQL error during state transition could leave machines in inconsistent states.

Invariants

State Machine Constraints

Single active state: Each entity has exactly one controller_state value at any time. The version string prevents concurrent updates (optimistic locking).
Valid transitions only: State controllers enforce transition graphs. For example, a machine cannot go from Created directly to Ready — it must pass through discovery, init, and validation states.
At most one in-progress controller: Queue tables ensure only one controller instance processes a given entity at a time (via processed_by + lock tables).
Retry bounds: Failed states track retry_count. Reprovisioning sub-machines have maximum retry limits to prevent infinite loops.
Sub-state consistency: Nested states (e.g., DPUInit.DpfStates.Provisioning) must be internally consistent — a DPU cannot be in DpfStates if DPF is disabled for that machine.

Database Constraints

UUID primary keys: All entities use UUID PKs, preventing sequential ID enumeration attacks.
Foreign key integrity: instances.machine_id references machines.id, network_segments.vpc_id references vpcs.id, etc.
Soft deletes: Most entities use deleted: TIMESTAMPTZ rather than hard deletes, preserving referential integrity and audit trail.
State history limit: Trigger-enforced 250-entry cap per entity on all *_state_history tables.
Version uniqueness: controller_state_version is unique per entity, preventing lost updates from concurrent state transitions.

Authorization Invariants

Authentication required: All API endpoints require either Basic Auth or OAuth2 (MS Entra). No anonymous access.
Casbin RBAC: Role-based access control policies are enforced at the handler level. Policy changes require API restart.
mTLS between services: Inter-service gRPC communication requires mutual TLS. Certificate rotation is managed via Vault.
Tenant isolation at VPC level: VPC resources are scoped to tenant_organization_id. Cross-tenant access is impossible by design.

Concurrency & Consistency

Optimistic concurrency on state: State transitions check controller_state_version. If the version has changed since the controller read it, the transition fails and retries.
Single-writer per entity type: Named lock tables (*_controller_lock) ensure only one controller processes a given entity type's queue at a time.
Stale claim detection: processing_started_at timestamps allow the system to reclaim work items from crashed controllers (timeout-based).
PostgreSQL transaction isolation: State transitions run within transactions, ensuring atomicity of state + history + side-effect updates.

Idempotency

State transitions are idempotent: If a controller processes a machine already in the target state, it's a no-op. This is critical for crash recovery.
DPU agent operations are retry-safe: Network configuration, firmware checks, and health reports can be re-applied without side effects.
Redfish BMC operations are NOT always idempotent: Firmware uploads and power operations have real physical effects. These are guarded by state checks before execution.
DHCP lease operations are idempotent: Updating an existing lease with the same parameters is safe.

Failure Modes

Machine Provisioning Flow

Step	What Can Fail	Current Handling	Correctness
DHCP Discovery	No DHCP response, duplicate MAC	Machine stays unpowered; DHCP server retries. Duplicate MACs detected via DB unique constraint.	Correct
BMC Exploration	BMC unreachable, wrong credentials	Exploration retries with backoff. Credential rotation via `UpdateMachineCredentials`. Machine stays in Created.	Correct
DPU Discovery	No DPUs found, DPU firmware incompatible	Transitions to Failed with `FailureCause::Discovery`. Admin can trigger DPU reprovisioning.	Correct
DPU OS Install (PXE)	PXE server down, image corrupt	DPU init retries. Timeout triggers Failed state. PXE health monitored separately.	Partial — no automatic PXE failover
Host Init	BIOS config failure, OS boot loop	Retry with BMC reset. `HostReprovisionState` tracks firmware upgrade attempts with `retry_count`.	Correct
Validation	GPU test failure, NVMe errors	Failed with `FailureCause::MachineValidation`. Machine quarantined. Specific test results stored.	Correct
TPM Attestation	Measurements don't match profile	Failed with `MeasurementsFailedSignatureCheck`, `MeasurementsRevoked`, or `MeasurementsCAValidationFailed`.	Correct — explicit failure variants

Instance Allocation Flow

Step	What Can Fail	Current Handling	Correctness
Machine Selection	No Ready machines available	AllocateInstance returns error. Admin must provision more machines or wait for cleanup.	Correct
VPC Network Config	VNI exhaustion, DPU agent unreachable	Instance stays in provisioning state. DPU agent reconnection retries. VNI pool managed by resource_pool table.	Correct
DHCP Lease Update	Kea DHCP server down	Provisioning retries. DHCP server has its own health monitoring.	Partial — extended outage blocks provisioning
IB Partition Setup	UFM unreachable, partition key conflict	IB partition state machine handles retries. Conflicts reported as provisioning failure.	Correct

Cascading Failures

If PostgreSQL is down: All state transitions halt. No new machines can be provisioned. Existing running instances are unaffected (DPU agent continues operating). API returns 503 on all state-mutating operations.
If Carbide API is down: DPU agents lose heartbeat but continue enforcing current network config. No new instances can be allocated. DHCP continues serving existing leases. Admin CLI cannot operate.
If DHCP is down: New machines cannot get IP addresses. Existing leases continue until expiry. No cascade to running instances.
If a DPU agent crashes: Network config persists on DPU hardware. Agent restarts and re-syncs state from API. Brief monitoring gap but no data plane disruption.
If BMC is unreachable: Cannot power cycle or update firmware on that specific machine. Machine state controller retries with backoff. Other machines unaffected.

Architecture Strengths

Compile-time safety across boundaries: Rust's type system + SQLx compile-time checks + protobuf codegen means entire classes of runtime errors are impossible. Schema changes are caught at build time.
PostgreSQL as single source of truth: No distributed state to reconcile. JSONB flexibility with relational guarantees. Transactions ensure atomicity of complex state transitions.
Zero-trust by design: DPU-enforced isolation is architecturally sound. The security model doesn't depend on host OS integrity, which is the correct assumption for multi-tenant bare-metal.
Comprehensive audit trail: State history tables + health history provide complete observability into machine lifecycle. Critical for debugging fleet-wide issues.
Modular crate structure: 65 crates with clear dependencies enable focused development. Teams can own specific crates without understanding the entire codebase.
Operational simplicity: Despite the complexity, the deployment is just PostgreSQL + N Rust binaries + Kubernetes. No external message brokers, no Redis, no distributed consensus systems.

Architecture Risks

PostgreSQL as bottleneck: All state transitions, queue polling, and history writes go through a single PostgreSQL instance. At fleet scale (10K+ machines with frequent state transitions), this could become a bottleneck. Mitigation: read replicas, connection pooling, queue polling optimization.
State machine complexity: 18+ top-level machine states with deeply nested sub-states (e.g., HostReprovisionState has 8 variants, each with sub-states). Understanding all valid transitions requires reading multiple source files. Risk of unreachable states or unhandled transitions.
Admin CLI size: 816 files mirroring the API surface. Changes to gRPC service definitions require updating both server handlers and CLI commands. High maintenance burden.
JSONB schema evolution: While JSONB avoids ALTER TABLE, the state structs in Rust code must handle deserialization of old state formats. Backward-compatible state evolution requires careful engineering.
Single-point failures in boot chain: PXE server, DHCP server, and DNS server are each single services. If PXE fails, no new machines can be provisioned. Mitigation: Kubernetes restarts, but no active-active redundancy.