ncx-infra-controller-rest Deep-Dive Analysis

Design Decisions

1. Declarative Conflict Rules for Task Scheduling

What it is: A hardcoded set of conflict rules that declare which operation types cannot coexist on the same rack, with scoping at both rack-level and component-level granularity.

Where it appears: rla/internal/task/conflict/conflict.go — the conflict engine that evaluates incoming tasks against active tasks.

What problem it solves: Prevents dangerous hardware state conflicts. For example, powering on a rack while firmware is being updated could brick NVSwitches. PowerShelf power operations block ALL rack operations because de-energizing the power shelf affects every component.

What would break without it: Concurrent power and firmware operations could damage hardware, corrupt firmware, or leave racks in unrecoverable states. The $200K+ cost per GPU rack makes this a critical safety mechanism.

Alternatives: Database-level pessimistic locking (too coarse), optimistic concurrency (too risky for hardware), or distributed locking (unnecessary complexity since RLA is the single coordinator per rack). The declarative approach is the right balance: explicit, auditable, and easy to extend.

2. Task Queuing with Promotion

What it is: When a task conflicts with an active task, it's stored with WAITING status and a TTL. A background promoter goroutine watches for completions and promotes the oldest waiting task.

Where it appears: rla/internal/task/conflict/promoter.go with a buffered notification channel (64 items) and a 5-minute sweep interval for recovery.

What problem it solves: Operators need to queue firmware upgrades across many racks without babysitting each one. Without queuing, they'd need to manually retry rejected tasks.

What would break without it: Every conflicting task would be rejected, requiring operators to build their own retry/scheduling logic. The max 5 waiting tasks per rack prevents unbounded queue growth.

Alternatives: External job scheduler (adds operational complexity), Temporal's built-in scheduling (doesn't understand rack-level conflicts), or a priority queue (premature — current FIFO with expiry is sufficient).

3. Operation Rules Engine

What it is: Database-stored JSON rules that define the sequence of steps for each operation type. Each step specifies a component type, batch size, pre/main/post actions, and timeouts.

Where it appears: rla/internal/task/operationrules/ — resolver loads rules from DB with fallback to hardcoded defaults.

What problem it solves: Different rack configurations require different power-on sequences. Some racks need PowerShelf first, others need NVSwitch first. Rules make this configurable without code changes.

What would break without it: Every rack configuration change would require a code deployment. The rack-specific override pattern (rack_rule_associations) allows per-rack customization.

Alternatives: Hardcoded sequences (inflexible), Temporal workflow composition (too complex for simple ordering changes), or a full workflow DSL (over-engineering for the current need).

4. Component Manager Abstraction

What it is: A registry of component-type-specific managers that implement a common interface (PowerControl, FirmwareControl, etc.), each backed by a different external API.

Where it appears: rla/internal/task/componentmanager/ with providers in providers/carbide/, providers/psm/, providers/nvswitchmanager/.

What problem it solves: The same logical operation (power on) requires completely different API calls depending on whether the target is a compute node (Carbide), power shelf (PSM), or NVSwitch (NSM).

What would break without it: Temporal workflows would need component-type-specific branching throughout, making them unmaintainable. Adding a new component type (e.g., CDU) would require modifying every workflow.

Alternatives: Unified hardware API (not realistic — each device type has fundamentally different management protocols), or type-switch in workflows (fragile and hard to test).

5. NVSwitch Firmware State Machine with Worker Pool

What it is: A multi-phase state machine (QUEUED→UPLOAD→INSTALL→VERIFY→CLEANUP→COMPLETED) with strategy-specific paths (Redfish, SSH, Script) and async polling via exec_context.

Where it appears: nvswitch-manager/pkg/firmwaremanager/types.go — state definitions, and the worker pool that drives transitions.

What problem it solves: Firmware updates are inherently asynchronous (Redfish returns task URIs, SSH commands take minutes). The state machine persists progress so updates survive service restarts.

What would break without it: Any service restart during firmware update would lose track of in-progress updates, potentially leaving switches in half-updated states.

Alternatives: Temporal for each switch update (too many workflows for 100s of switches), synchronous blocking (wastes resources), or fire-and-forget (loses visibility).

6. Echo Framework with Route Struct Pattern

What it is: Each handler is a struct implementing RequestHandler interface, with routes defined as Route{Path, Method, Handler} structs registered via e.Add().

Where it appears: api/pkg/api/util.go (interface), api/internal/server/server.go (registration), api/pkg/api/handler/ (43 implementations).

What problem it solves: Separates HTTP routing from business logic. Each handler struct holds its dependencies (DB session, Temporal client, site client pool), making them independently testable.

What would break without it: Dependencies would need to be global or passed through context, making testing harder and creating implicit coupling.

Alternatives: Functional handlers with closure-captured deps (works but less structured), DI framework (Go community prefers explicit wiring), or generated server from OpenAPI (the project has a spec but generates SDK clients, not server code).


Invariants

State Machine Constraints

Database Constraints

Authorization Invariants

Concurrency & Consistency

Idempotency


Failure Modes

Rack Power-On Flow

Step What Can Fail Current Handling Correctness
JWT validation Expired/invalid token, Keycloak unavailable 401 Unauthorized returned immediately Correct
Conflict detection DB connection failure during active task query Task creation fails, 500 returned to client Correct — fails safe (no task created)
Rule resolution No matching rule and no default Falls back to hardcoded defaults in resolver_defaults.go Correct — always has a fallback
Temporal workflow start Temporal server unavailable Task stays PENDING, executor returns error. No automatic retry. Partial — task is stuck PENDING with no recovery mechanism
PowerShelf power-on PSM unreachable or shelf hardware fault Temporal retries activity 3x with backoff. Fails workflow on exhaustion. Correct — task marked FAILED, promoter unblocks queue
Compute power-on Carbide API timeout, partial node failure Activity retried. On partial failure, entire batch fails. Partial — no partial success handling; all-or-nothing per batch
Task status update DB unavailable when workflow completes Temporal will retry the final activity that updates task status Correct — Temporal durability guarantees completion

NVSwitch Firmware Update Flow

Step What Can Fail Current Handling Correctness
Firmware upload (Redfish) BMC rejects image, network timeout Worker marks update FAILED, cancels dependent updates (predecessor chain) Correct — cascade cancellation prevents partial upgrade
Firmware install Installation fails, switch becomes unresponsive Deadline-based timeout in exec_context. Marked FAILED after deadline. Correct
Version verification Version doesn't match target after install Update marked FAILED with version mismatch error Correct — prevents silent failures
Worker pool crash NSM service restarts during update State machine persists in DB. Worker resumes from last persisted state. Correct — this is the key advantage of the state machine pattern
SSH reachability (NVOS) Switch unreachable after reboot for NVOS update WAIT_REACHABLE state with BecameUnreachableAt tracking and deadline Correct — handles expected post-reboot unreachability

Cascading Failures


Architecture Strengths

  1. Temporal for durability: Long-running hardware operations (firmware, bringup) survive service restarts. This is critical for operations that take 10+ minutes and cannot be simply retried from scratch.
  2. Component manager abstraction: Clean separation between orchestration logic (what to do) and hardware specifics (how to do it). Adding a new component type (e.g., CDU cooling) requires only a new provider implementation.
  3. Conflict detection prevents hardware damage: The declarative rules make dangerous concurrent operations impossible. This is a hard safety requirement for infrastructure at this cost level.
  4. Drift detection provides continuous validation: The inventory sync loop catches configuration drift before it causes incidents, enabling proactive remediation.
  5. Multi-service isolation: Each hardware domain (compute, NVSwitch, power) has its own service with independent databases and deployment lifecycle. A bug in NSM cannot take down the core API.
  6. OpenAPI spec as source of truth: The 727KB OpenAPI 3.1.0 spec enables SDK generation, documentation, and contract testing against the actual implementation.

Architecture Risks

  1. Stuck PENDING tasks: If Temporal is unreachable when a task is created, it stays PENDING with no automatic recovery. The promoter only watches for completed tasks, not stuck pending ones. A health check or timeout on PENDING tasks would close this gap.
  2. All-or-nothing batch operations: If 7 of 8 compute nodes power on but 1 fails, the entire batch fails. Partial success handling would improve resilience for large racks.
  3. Hardcoded conflict rules: While the operation rules are database-configurable, the conflict rules are hardcoded in Go. If a new component type needs different conflict semantics, it requires a code change and redeployment.
  4. Single-writer RLA per rack: The conflict detection assumes a single RLA instance manages each rack. If two RLA instances process the same rack (e.g., during rolling update), the conflict check has a race window between read and write.
  5. gRPC fan-out latency: The API server fans out to RLA, which fans out to Carbide/PSM/NSM. Three hops of gRPC add latency and failure points. For read-heavy operations (list racks with status), this could become a bottleneck.