ncx-infra-controller-rest Deep-Dive Analysis

Design Decisions

1. Declarative Conflict Rules for Task Scheduling

What it is: A hardcoded set of conflict rules that declare which operation types cannot coexist on the same rack, with scoping at both rack-level and component-level granularity.

Where it appears: rla/internal/task/conflict/conflict.go — the conflict engine that evaluates incoming tasks against active tasks.

What problem it solves: Prevents dangerous hardware state conflicts. For example, powering on a rack while firmware is being updated could brick NVSwitches. PowerShelf power operations block ALL rack operations because de-energizing the power shelf affects every component.

What would break without it: Concurrent power and firmware operations could damage hardware, corrupt firmware, or leave racks in unrecoverable states. The $200K+ cost per GPU rack makes this a critical safety mechanism.

Alternatives: Database-level pessimistic locking (too coarse), optimistic concurrency (too risky for hardware), or distributed locking (unnecessary complexity since RLA is the single coordinator per rack). The declarative approach is the right balance: explicit, auditable, and easy to extend.

2. Task Queuing with Promotion

What it is: When a task conflicts with an active task, it's stored with WAITING status and a TTL. A background promoter goroutine watches for completions and promotes the oldest waiting task.

Where it appears: rla/internal/task/conflict/promoter.go with a buffered notification channel (64 items) and a 5-minute sweep interval for recovery.

What problem it solves: Operators need to queue firmware upgrades across many racks without babysitting each one. Without queuing, they'd need to manually retry rejected tasks.

What would break without it: Every conflicting task would be rejected, requiring operators to build their own retry/scheduling logic. The max 5 waiting tasks per rack prevents unbounded queue growth.

Alternatives: External job scheduler (adds operational complexity), Temporal's built-in scheduling (doesn't understand rack-level conflicts), or a priority queue (premature — current FIFO with expiry is sufficient).

3. Operation Rules Engine

What it is: Database-stored JSON rules that define the sequence of steps for each operation type. Each step specifies a component type, batch size, pre/main/post actions, and timeouts.

Where it appears: rla/internal/task/operationrules/ — resolver loads rules from DB with fallback to hardcoded defaults.

What problem it solves: Different rack configurations require different power-on sequences. Some racks need PowerShelf first, others need NVSwitch first. Rules make this configurable without code changes.

What would break without it: Every rack configuration change would require a code deployment. The rack-specific override pattern (rack_rule_associations) allows per-rack customization.

Alternatives: Hardcoded sequences (inflexible), Temporal workflow composition (too complex for simple ordering changes), or a full workflow DSL (over-engineering for the current need).

4. Component Manager Abstraction

What it is: A registry of component-type-specific managers that implement a common interface (PowerControl, FirmwareControl, etc.), each backed by a different external API.

Where it appears: rla/internal/task/componentmanager/ with providers in providers/carbide/, providers/psm/, providers/nvswitchmanager/.

What problem it solves: The same logical operation (power on) requires completely different API calls depending on whether the target is a compute node (Carbide), power shelf (PSM), or NVSwitch (NSM).

What would break without it: Temporal workflows would need component-type-specific branching throughout, making them unmaintainable. Adding a new component type (e.g., CDU) would require modifying every workflow.

Alternatives: Unified hardware API (not realistic — each device type has fundamentally different management protocols), or type-switch in workflows (fragile and hard to test).

5. NVSwitch Firmware State Machine with Worker Pool

What it is: A multi-phase state machine (QUEUED→UPLOAD→INSTALL→VERIFY→CLEANUP→COMPLETED) with strategy-specific paths (Redfish, SSH, Script) and async polling via exec_context.

Where it appears: nvswitch-manager/pkg/firmwaremanager/types.go — state definitions, and the worker pool that drives transitions.

What problem it solves: Firmware updates are inherently asynchronous (Redfish returns task URIs, SSH commands take minutes). The state machine persists progress so updates survive service restarts.

What would break without it: Any service restart during firmware update would lose track of in-progress updates, potentially leaving switches in half-updated states.

Alternatives: Temporal for each switch update (too many workflows for 100s of switches), synchronous blocking (wastes resources), or fire-and-forget (loses visibility).

6. Echo Framework with Route Struct Pattern

What it is: Each handler is a struct implementing RequestHandler interface, with routes defined as Route{Path, Method, Handler} structs registered via e.Add().

Where it appears: api/pkg/api/util.go (interface), api/internal/server/server.go (registration), api/pkg/api/handler/ (43 implementations).

What problem it solves: Separates HTTP routing from business logic. Each handler struct holds its dependencies (DB session, Temporal client, site client pool), making them independently testable.

What would break without it: Dependencies would need to be global or passed through context, making testing harder and creating implicit coupling.

Alternatives: Functional handlers with closure-captured deps (works but less structured), DI framework (Go community prefers explicit wiring), or generated server from OpenAPI (the project has a spec but generates SDK clients, not server code).

Invariants

State Machine Constraints

Task status transitions are strictly ordered: WAITING→PENDING→RUNNING→{SUCCESS,FAILED,CANCELLED}. No backward transitions. WAITING is NOT a terminal state — only SUCCESS, FAILED, CANCELLED are terminal.
Exactly one active task per rack at a time (unless conflict rules permit component-level coexistence). The conflict engine enforces this before task creation.
Rack status is monotonic: NEW→INGESTING→INGESTED. No rollback to previous states.
NVSwitch firmware updates follow strategy-specific paths: Redfish skips WAIT_REACHABLE/COPY, SSH includes them. The state machine enforces valid transitions per strategy.
Machine bring-up is strictly linear: NotDiscovered→WaitingForIngestion→MachineNotCreated→MachineCreated. IsBroughtUp() returns true only for the final state.

Database Constraints

Rack uniqueness: UNIQUE(manufacturer, serial_number) and UNIQUE(name) prevent duplicate rack registration.
Component uniqueness: UNIQUE(manufacturer, serial_number) and UNIQUE(type, external_id) WHERE external_id IS NOT NULL prevent duplicate components.
BMC primary key on MAC address: Ensures no two BMCs share the same physical address.
Operation rule defaults: Partial unique constraint ensures at most one default rule per (operation_type, operation_code) combination.
Soft delete on components: deleted_at column ensures component history is preserved for audit.
Task index on (rack_id, status): Optimized for the critical conflict-checking query path.

Authorization Invariants

All API routes (except /healthz, /readyz) require JWT authentication via Keycloak.
Tenant isolation: all resources are scoped to /org/:orgName/ — cross-tenant access is impossible at the routing level.
Service account tokens are used for inter-service communication (API→Site Agent, RLA→external managers).
mTLS certificates (SPIFFE) protect gRPC channels between services in Kubernetes.

Concurrency & Consistency

Conflict detection is serialized per rack: The task manager checks and creates tasks atomically within a DB transaction, preventing race conditions where two conflicting tasks could both start.
Promoter uses buffered channel (64): Prevents goroutine blocking while ensuring no completion events are lost. 5-minute sweep catches any missed promotions.
Temporal provides exactly-once workflow execution: Even if the RLA service restarts, in-progress workflows continue and report back to the task store.
Firmware update worker pool uses DB polling: Multiple NSM instances can safely run — each worker claims updates via state transitions, preventing duplicate processing.

Idempotency

Task submission is NOT idempotent: Each call creates a new task. Clients should check existing tasks before submitting.
Inventory sync is idempotent: Drift records are fully rebuilt each cycle (DELETE + INSERT), so running it twice produces the same result.
Temporal activities use retry policies: Activities must be idempotent since they may be retried up to 3 times with exponential backoff.
Firmware state transitions are idempotent: Re-entering the same state is a no-op, preventing issues from duplicate worker pool processing.

Failure Modes

Rack Power-On Flow

Step	What Can Fail	Current Handling	Correctness
JWT validation	Expired/invalid token, Keycloak unavailable	401 Unauthorized returned immediately	Correct
Conflict detection	DB connection failure during active task query	Task creation fails, 500 returned to client	Correct — fails safe (no task created)
Rule resolution	No matching rule and no default	Falls back to hardcoded defaults in resolver_defaults.go	Correct — always has a fallback
Temporal workflow start	Temporal server unavailable	Task stays PENDING, executor returns error. No automatic retry.	Partial — task is stuck PENDING with no recovery mechanism
PowerShelf power-on	PSM unreachable or shelf hardware fault	Temporal retries activity 3x with backoff. Fails workflow on exhaustion.	Correct — task marked FAILED, promoter unblocks queue
Compute power-on	Carbide API timeout, partial node failure	Activity retried. On partial failure, entire batch fails.	Partial — no partial success handling; all-or-nothing per batch
Task status update	DB unavailable when workflow completes	Temporal will retry the final activity that updates task status	Correct — Temporal durability guarantees completion

NVSwitch Firmware Update Flow

Step	What Can Fail	Current Handling	Correctness
Firmware upload (Redfish)	BMC rejects image, network timeout	Worker marks update FAILED, cancels dependent updates (predecessor chain)	Correct — cascade cancellation prevents partial upgrade
Firmware install	Installation fails, switch becomes unresponsive	Deadline-based timeout in exec_context. Marked FAILED after deadline.	Correct
Version verification	Version doesn't match target after install	Update marked FAILED with version mismatch error	Correct — prevents silent failures
Worker pool crash	NSM service restarts during update	State machine persists in DB. Worker resumes from last persisted state.	Correct — this is the key advantage of the state machine pattern
SSH reachability (NVOS)	Switch unreachable after reboot for NVOS update	WAIT_REACHABLE state with BecameUnreachableAt tracking and deadline	Correct — handles expected post-reboot unreachability

Cascading Failures

If PostgreSQL is down: All API requests fail (500), all task submissions fail, no new tasks can be created or promoted. Temporal workflows already in-flight continue but cannot update task status until DB recovers. Risk: orphaned running tasks that appear stuck.
If Temporal is down: New task submissions succeed (PENDING) but workflows cannot start. Existing workflows pause and resume when Temporal recovers. Risk: growing backlog of PENDING tasks.
If Carbide/PSM/NSM is down: Only the affected component type operations fail. Other component types proceed normally. Temporal retries provide resilience for transient failures.
If Keycloak is down: All authenticated API requests fail. Health checks (/healthz) still work. Inter-service communication using service accounts may also be affected if tokens expire.
If RLA promoter goroutine dies: Waiting tasks are not promoted. The 5-minute sweep provides a safety net — if the main promoter channel is lost, the sweep will still find and promote eligible tasks.

Architecture Strengths

Temporal for durability: Long-running hardware operations (firmware, bringup) survive service restarts. This is critical for operations that take 10+ minutes and cannot be simply retried from scratch.
Component manager abstraction: Clean separation between orchestration logic (what to do) and hardware specifics (how to do it). Adding a new component type (e.g., CDU cooling) requires only a new provider implementation.
Conflict detection prevents hardware damage: The declarative rules make dangerous concurrent operations impossible. This is a hard safety requirement for infrastructure at this cost level.
Drift detection provides continuous validation: The inventory sync loop catches configuration drift before it causes incidents, enabling proactive remediation.
Multi-service isolation: Each hardware domain (compute, NVSwitch, power) has its own service with independent databases and deployment lifecycle. A bug in NSM cannot take down the core API.
OpenAPI spec as source of truth: The 727KB OpenAPI 3.1.0 spec enables SDK generation, documentation, and contract testing against the actual implementation.

Architecture Risks

Stuck PENDING tasks: If Temporal is unreachable when a task is created, it stays PENDING with no automatic recovery. The promoter only watches for completed tasks, not stuck pending ones. A health check or timeout on PENDING tasks would close this gap.
All-or-nothing batch operations: If 7 of 8 compute nodes power on but 1 fails, the entire batch fails. Partial success handling would improve resilience for large racks.
Hardcoded conflict rules: While the operation rules are database-configurable, the conflict rules are hardcoded in Go. If a new component type needs different conflict semantics, it requires a code change and redeployment.
Single-writer RLA per rack: The conflict detection assumes a single RLA instance manages each rack. If two RLA instances process the same rack (e.g., during rolling update), the conflict check has a race window between read and write.
gRPC fan-out latency: The API server fans out to RLA, which fans out to Carbide/PSM/NSM. Three hops of gRPC add latency and failure points. For read-heavy operations (list racks with status), this could become a bottleneck.