ncx-infra-controller-core Deep-Dive Analysis
Design Decisions
1. JSONB-Backed Versioned State Machines
What it is: Every stateful entity (machine, rack, switch, network segment, DPA interface, power shelf) stores its controller state as a JSONB column (controller_state) paired with a version string (controller_state_version). State transitions are recorded in append-only *_state_history tables.
Where it appears: crates/api-db/migrations/ (304 migration files), crates/api/src/state_controller/ (20 modules), all entity tables in PostgreSQL.
What problem it solves: Complex state machines with deeply nested sub-states (e.g., ManagedHostState has 18+ top-level variants, each with sub-states like DpuDiscoveringStates, ValidationState, HostReprovisionState) cannot be modeled as simple SQL enums. JSONB allows rich, evolving state structures without ALTER TABLE migrations for every new sub-state.
What would break without it: Adding new states or sub-states would require database migrations and downtime. The version string enables optimistic concurrency control — without it, two controllers processing the same machine could overwrite each other's state transitions.
Alternatives: Dedicated state tables per entity type (explosion of tables), external state stores like Redis (loses transactional consistency with domain data), event sourcing (much higher complexity for the same guarantees).
2. Distributed State Controller with Queue Tables
What it is: Each entity type has a *_controller_queued_objects table with processed_by and processing_started_at columns. Controllers claim work items by setting these fields, and named lock tables (*_controller_lock) prevent concurrent processing of the same entity.
Where it appears: crates/api/src/state_controller/controller.rs, queue tables for machine, rack, switch, network segment, IB partition, DPA interface, power shelf, attestation.
What problem it solves: Multiple API server replicas must coordinate state machine processing without an external message broker. PostgreSQL-based queuing gives exactly-once processing semantics with the same transactional guarantees as the state updates.
What would break without it: Race conditions where two controllers advance the same machine through incompatible states. State corruption when a controller crashes mid-transition (the lock/timeout mechanism detects stale claims).
Alternatives: External message queue (RabbitMQ, NATS) — adds operational dependency and two-phase commit complexity. Advisory locks — less visible for debugging. The chosen approach keeps everything in PostgreSQL, simplifying operations.
3. DPU-Enforced Zero-Trust Isolation
What it is: All network isolation and security enforcement happens on the Bluefield DPU (Data Processing Unit), not on the host machine. The host is explicitly treated as untrustworthy. The DPU agent manages VPC networking, firewall rules, and data plane forwarding.
Where it appears: crates/agent/ (DPU agent), crates/dpf/ (data plane forwarding), crates/api/src/state_controller/dpa_interface/, design principles in book/src/README.md.
What problem it solves: In multi-tenant bare-metal environments, the host OS can be compromised or misconfigured. By enforcing isolation at the DPU level (separate ARM64 processor with its own OS), tenant workloads cannot bypass network security even with root access on the host.
What would break without it: A compromised host could sniff other tenants' traffic, bypass VPC isolation, or attack the management plane. The entire security model depends on the DPU being the trust anchor.
Alternatives: Software-defined networking on the host (vulnerable to root compromise), hardware switches only (insufficient per-machine granularity), hypervisor-based isolation (not applicable to bare-metal).
4. Monorepo with 65 Specialized Crates
What it is: The entire system (20 binaries, all libraries) lives in a single Cargo workspace with 65 crates. Each crate has a focused responsibility: api-model for domain types, api-db for database access, rpc for protobuf definitions, etc.
Where it appears: Cargo.toml (workspace root), crates/ directory.
What problem it solves: Compile-time type safety across all service boundaries (gRPC types, DB models, domain types shared via crate dependencies). A single CI pipeline validates the entire system. Refactoring domain types propagates errors everywhere they need fixing.
What would break without it: Version skew between services, runtime serialization errors instead of compile-time type errors, duplicated type definitions that drift apart.
Alternatives: Polyrepo (version management nightmare for 20 services), fewer larger crates (longer compile times, less clear boundaries). The 65-crate approach optimizes for both compile granularity and clear separation of concerns.
5. Append-Only History with Auto-Cleanup
What it is: Every state transition is logged to *_state_history tables. Database triggers automatically trim entries to keep the most recent 250 per entity. Health observations similarly use append-only machine_health_history.
Where it appears: Migration files creating history tables and cleanup triggers, crates/api-db/src/sql/machine_snapshots.sql.template.
What problem it solves: Full audit trail for debugging machine lifecycle issues (why did this machine go to Failed state?) without unbounded storage growth. The 250-entry limit prevents history tables from dominating database size in fleets with thousands of machines.
What would break without it: Without history: impossible to debug intermittent failures or understand why a machine is in a particular state. Without cleanup: history tables grow unbounded, slow down queries, and consume storage.
Alternatives: External log aggregation (Loki/ELK) — less queryable, separate system. Time-based retention — unpredictable storage. The count-based trigger approach is simple and predictable.
6. SQLx Compile-Time Checked Queries
What it is: All database queries use SQLx macros that validate SQL against the actual PostgreSQL schema at compile time. Query parameters and return types are statically verified.
Where it appears: crates/api-db/src/ (96 files), sqlx-data.json for offline verification.
What problem it solves: Eliminates an entire class of runtime SQL errors (wrong column names, type mismatches, missing tables). With 304 migrations and complex JSONB queries, this is critical for correctness.
What would break without it: Schema changes that break queries would only be discovered at runtime. In a system managing thousands of machines, a SQL error during state transition could leave machines in inconsistent states.
Invariants
State Machine Constraints
- Single active state: Each entity has exactly one
controller_statevalue at any time. The version string prevents concurrent updates (optimistic locking). - Valid transitions only: State controllers enforce transition graphs. For example, a machine cannot go from
Createddirectly toReady— it must pass through discovery, init, and validation states. - At most one in-progress controller: Queue tables ensure only one controller instance processes a given entity at a time (via
processed_by+ lock tables). - Retry bounds: Failed states track
retry_count. Reprovisioning sub-machines have maximum retry limits to prevent infinite loops. - Sub-state consistency: Nested states (e.g.,
DPUInit.DpfStates.Provisioning) must be internally consistent — a DPU cannot be inDpfStatesif DPF is disabled for that machine.
Database Constraints
- UUID primary keys: All entities use UUID PKs, preventing sequential ID enumeration attacks.
- Foreign key integrity:
instances.machine_idreferencesmachines.id,network_segments.vpc_idreferencesvpcs.id, etc. - Soft deletes: Most entities use
deleted: TIMESTAMPTZrather than hard deletes, preserving referential integrity and audit trail. - State history limit: Trigger-enforced 250-entry cap per entity on all
*_state_historytables. - Version uniqueness:
controller_state_versionis unique per entity, preventing lost updates from concurrent state transitions.
Authorization Invariants
- Authentication required: All API endpoints require either Basic Auth or OAuth2 (MS Entra). No anonymous access.
- Casbin RBAC: Role-based access control policies are enforced at the handler level. Policy changes require API restart.
- mTLS between services: Inter-service gRPC communication requires mutual TLS. Certificate rotation is managed via Vault.
- Tenant isolation at VPC level: VPC resources are scoped to
tenant_organization_id. Cross-tenant access is impossible by design.
Concurrency & Consistency
- Optimistic concurrency on state: State transitions check
controller_state_version. If the version has changed since the controller read it, the transition fails and retries. - Single-writer per entity type: Named lock tables (
*_controller_lock) ensure only one controller processes a given entity type's queue at a time. - Stale claim detection:
processing_started_attimestamps allow the system to reclaim work items from crashed controllers (timeout-based). - PostgreSQL transaction isolation: State transitions run within transactions, ensuring atomicity of state + history + side-effect updates.
Idempotency
- State transitions are idempotent: If a controller processes a machine already in the target state, it's a no-op. This is critical for crash recovery.
- DPU agent operations are retry-safe: Network configuration, firmware checks, and health reports can be re-applied without side effects.
- Redfish BMC operations are NOT always idempotent: Firmware uploads and power operations have real physical effects. These are guarded by state checks before execution.
- DHCP lease operations are idempotent: Updating an existing lease with the same parameters is safe.
Failure Modes
Machine Provisioning Flow
| Step | What Can Fail | Current Handling | Correctness |
|---|---|---|---|
| DHCP Discovery | No DHCP response, duplicate MAC | Machine stays unpowered; DHCP server retries. Duplicate MACs detected via DB unique constraint. | Correct |
| BMC Exploration | BMC unreachable, wrong credentials | Exploration retries with backoff. Credential rotation via UpdateMachineCredentials. Machine stays in Created. |
Correct |
| DPU Discovery | No DPUs found, DPU firmware incompatible | Transitions to Failed with FailureCause::Discovery. Admin can trigger DPU reprovisioning. |
Correct |
| DPU OS Install (PXE) | PXE server down, image corrupt | DPU init retries. Timeout triggers Failed state. PXE health monitored separately. | Partial — no automatic PXE failover |
| Host Init | BIOS config failure, OS boot loop | Retry with BMC reset. HostReprovisionState tracks firmware upgrade attempts with retry_count. |
Correct |
| Validation | GPU test failure, NVMe errors | Failed with FailureCause::MachineValidation. Machine quarantined. Specific test results stored. |
Correct |
| TPM Attestation | Measurements don't match profile | Failed with MeasurementsFailedSignatureCheck, MeasurementsRevoked, or MeasurementsCAValidationFailed. |
Correct — explicit failure variants |
Instance Allocation Flow
| Step | What Can Fail | Current Handling | Correctness |
|---|---|---|---|
| Machine Selection | No Ready machines available | AllocateInstance returns error. Admin must provision more machines or wait for cleanup. | Correct |
| VPC Network Config | VNI exhaustion, DPU agent unreachable | Instance stays in provisioning state. DPU agent reconnection retries. VNI pool managed by resource_pool table. | Correct |
| DHCP Lease Update | Kea DHCP server down | Provisioning retries. DHCP server has its own health monitoring. | Partial — extended outage blocks provisioning |
| IB Partition Setup | UFM unreachable, partition key conflict | IB partition state machine handles retries. Conflicts reported as provisioning failure. | Correct |
Cascading Failures
- If PostgreSQL is down: All state transitions halt. No new machines can be provisioned. Existing running instances are unaffected (DPU agent continues operating). API returns 503 on all state-mutating operations.
- If Carbide API is down: DPU agents lose heartbeat but continue enforcing current network config. No new instances can be allocated. DHCP continues serving existing leases. Admin CLI cannot operate.
- If DHCP is down: New machines cannot get IP addresses. Existing leases continue until expiry. No cascade to running instances.
- If a DPU agent crashes: Network config persists on DPU hardware. Agent restarts and re-syncs state from API. Brief monitoring gap but no data plane disruption.
- If BMC is unreachable: Cannot power cycle or update firmware on that specific machine. Machine state controller retries with backoff. Other machines unaffected.
Architecture Strengths
- Compile-time safety across boundaries: Rust's type system + SQLx compile-time checks + protobuf codegen means entire classes of runtime errors are impossible. Schema changes are caught at build time.
- PostgreSQL as single source of truth: No distributed state to reconcile. JSONB flexibility with relational guarantees. Transactions ensure atomicity of complex state transitions.
- Zero-trust by design: DPU-enforced isolation is architecturally sound. The security model doesn't depend on host OS integrity, which is the correct assumption for multi-tenant bare-metal.
- Comprehensive audit trail: State history tables + health history provide complete observability into machine lifecycle. Critical for debugging fleet-wide issues.
- Modular crate structure: 65 crates with clear dependencies enable focused development. Teams can own specific crates without understanding the entire codebase.
- Operational simplicity: Despite the complexity, the deployment is just PostgreSQL + N Rust binaries + Kubernetes. No external message brokers, no Redis, no distributed consensus systems.
Architecture Risks
- PostgreSQL as bottleneck: All state transitions, queue polling, and history writes go through a single PostgreSQL instance. At fleet scale (10K+ machines with frequent state transitions), this could become a bottleneck. Mitigation: read replicas, connection pooling, queue polling optimization.
- State machine complexity: 18+ top-level machine states with deeply nested sub-states (e.g.,
HostReprovisionStatehas 8 variants, each with sub-states). Understanding all valid transitions requires reading multiple source files. Risk of unreachable states or unhandled transitions. - Admin CLI size: 816 files mirroring the API surface. Changes to gRPC service definitions require updating both server handlers and CLI commands. High maintenance burden.
- JSONB schema evolution: While JSONB avoids ALTER TABLE, the state structs in Rust code must handle deserialization of old state formats. Backward-compatible state evolution requires careful engineering.
- Single-point failures in boot chain: PXE server, DHCP server, and DNS server are each single services. If PXE fails, no new machines can be provisioned. Mitigation: Kubernetes restarts, but no active-active redundancy.