bare-metal-manager-rest Deep-Dive Analysis

Design Decisions

1. Microservices Monorepo with Shared go.mod

What it is: All 10 microservices (API, RLA, workflow worker, site-agent, site-manager, IPAM, NVSwitch Manager, PowerShelf Manager, cert-manager, CLI) live in a single Go module with one go.mod at the root.

Where it appears: Root go.mod with github.com/NVIDIA/ncx-infra-controller-rest module path. Each service has its own cmd/ entry point and internal/ package tree.

What problem it solves: Shared types and utilities (common/, auth/, db/) can be imported directly without versioning headaches. Refactoring a shared type updates all consumers atomically in one commit.

What would break without it: Splitting into separate modules would require publishing internal packages, managing version compatibility between services, and coordinating multi-repo PRs for cross-cutting changes.

Alternatives: Multi-module monorepo (Go workspace), separate repositories per service. The monorepo was chosen because the services share significant internal types (DB models, auth, common config) and must evolve together.

2. Temporal for Workflow Orchestration

What it is: All long-running operations (firmware upgrades, instance provisioning, power control, inventory sync) are implemented as Temporal workflows rather than synchronous API calls or simple async job queues.

Where it appears: workflow/ (cloud workflows), site-workflow/ (site workflows), and the RLA task executor in rla/internal/task/executor/.

What problem it solves: Hardware operations can take minutes to hours (firmware flash, OS install). They need retries, timeouts, fan-out/fan-in, and crash recovery. Temporal provides all of this declaratively.

What would break without it: Replacing Temporal with a simple job queue would lose: automatic activity retries with backoff, workflow state persistence across crashes, saga-style compensation for partial failures, and the ability to query running workflow state.

Alternatives: Redis-based job queues (Asynq), database-backed state machines, Kubernetes Jobs. Temporal was chosen for its durability guarantees and native Go SDK support.

3. gRPC Sidecar Pattern for Hardware Managers

What it is: NVSwitch Manager and PowerShelf Manager run as gRPC sidecars alongside the RLA service rather than being called directly from the REST API.

Where it appears: rla/internal/nsmapi/ and rla/internal/psmapi/ connect to localhost gRPC endpoints. The RLA service acts as a unified facade.

What problem it solves: Different hardware types (GPU switches, power shelves, compute nodes) have completely different management protocols and firmware update mechanisms. The sidecar pattern keeps each manager independent while RLA provides a unified task abstraction.

What would break without it: Embedding all hardware-specific logic in RLA would create a monolithic service that's hard to test, deploy, and upgrade independently. Different hardware vendors could block each other.

Alternatives: Direct REST-to-hardware calls, plugin architecture within RLA, separate top-level services with independent APIs. The sidecar approach was chosen for deployment simplicity (co-located with RLA) while maintaining code isolation.

4. Advisory Locks for Task Conflict Resolution

What it is: PostgreSQL advisory locks (pg_advisory_lock) are used to serialize task submissions per rack, preventing concurrent conflicting operations on the same hardware.

Where it appears: rla/internal/task/conflict/ - the ConflictResolver acquires advisory locks before checking for conflicts, and the ConflictPromoter auto-promotes waiting tasks.

What problem it solves: Two admins might simultaneously submit a power-off and firmware-upgrade targeting the same rack. Without serialization, both could start, leading to hardware damage or corrupted firmware.

What would break without it: Race conditions between task submissions. Regular database transactions only prevent data conflicts, not logical conflicts between concurrent hardware operations.

Alternatives: Application-level mutexes (single-instance only), distributed locks (Redis/etcd), optimistic concurrency with version columns. Advisory locks were chosen because they're built into PostgreSQL, support per-rack granularity, and automatically release on connection close.

5. Handler Dependency Injection Pattern

What it is: Each HTTP handler is a struct with constructor injection: NewXxxHandler(dbSession, temporalClient, siteClientPool, config). No global state or service locator.

Where it appears: All 86 handler files in api/pkg/api/handler/. Route registration in api/pkg/api/routes.go instantiates handlers with shared dependencies.

What problem it solves: Each handler explicitly declares its dependencies, making it testable (mock any dependency) and clear what external systems a given endpoint touches.

What would break without it: Global state or implicit service locators make it impossible to run handlers in isolation, lead to hidden coupling, and make the dependency graph opaque.

Alternatives: Wire/dig dependency injection frameworks, global singleton services, context-based injection. The manual constructor pattern was chosen for explicitness and Go idiom compliance.

6. Dual Database Layer (Bun ORM + Raw pgx)

What it is: The main API uses Bun ORM for type-safe queries and model mapping, while the connection pool uses raw pgx for performance-critical operations like advisory locks and bulk operations.

Where it appears: db/pkg/db/ provides the session (pgx pool), Bun wraps it for ORM queries. Advisory locks use raw SQL. RLA and hardware managers also use Bun for their own schemas.

What problem it solves: Bun provides type safety and query building for standard CRUD, while pgx gives full control for PostgreSQL-specific features (advisory locks, LISTEN/NOTIFY, custom types).

What would break without it: Using only an ORM would lose access to PostgreSQL-specific features. Using only raw SQL would require manual model mapping and lose type safety.

Invariants

State Machine Constraints

Task terminal states are final: Once a task reaches completed, failed, or terminated, its status cannot change. The IsFinished() method enforces this check.
Only one active task per rack per component overlap: The conflict resolver ensures no two running tasks target overlapping components on the same rack. Waiting tasks have a TTL (queue_expires_at).
Rack status progression is monotonic: new → ingesting → ingested. A rack cannot go backward from ingested to new.
Firmware update states are terminal-checked: IsTerminal() returns true only for Completed, Failed, Cancelled. Active states are defined as non-terminal and non-queued.
Instance status must follow valid transitions: Pending → Provisioning → Configuring → Ready. Error can be reached from any non-terminal state. Terminating is only reachable from Ready.

Database Constraints

Rack uniqueness: UNIQUE(name) and UNIQUE(manufacturer, serial_number) prevent duplicate rack registrations.
Component uniqueness: UNIQUE(manufacturer, serial_number) and partial unique index on (type, external_id) where external_id IS NOT NULL.
BMC primary key is MAC address: Each BMC is uniquely identified by its hardware MAC address, preventing duplicate registrations.
Operation rule defaults are unique per operation: Partial unique index ensures at most one default rule per (operation_type, operation_code) combination.
Foreign key cascading: Components reference racks, BMCs reference components, tasks reference racks. Soft deletes (deleted_at) preserve referential integrity for audit.

Authorization Invariants

Every API request (except /healthz, /readyz) must carry a valid JWT token. Expired or malformed tokens are rejected with 401.
The orgName in the URL path must match the organization claim in the JWT. Cross-org access is impossible.
Provider Admin can manage infrastructure (sites, racks, machines). Tenant Admin can manage tenant resources (instances, VPCs, SSH keys). Neither can access resources outside their role scope.
Service accounts (machine-to-machine) bypass Keycloak but still require valid JWT with appropriate claims.

Concurrency & Consistency

Advisory locks per rack: Task submission acquires pg_advisory_lock(rack_id_hash) to serialize conflict detection. Released automatically on transaction commit/rollback.
Transaction isolation for allocations: IP block allocation and instance creation use database transactions with advisory locks to prevent double-allocation.
Temporal workflow uniqueness: Each task gets a unique workflow ID. Temporal rejects duplicate workflow starts, preventing accidental re-execution.
Max 5 waiting tasks per rack: MaxWaitingTasksPerRack constant prevents queue saturation. Beyond 5, new tasks are rejected even with queue strategy.

Idempotency

Inventory PatchRack is idempotent: Re-syncing the same inventory data produces no changes. Components are matched by serial number and upserted.
Task submission is NOT idempotent: Each submission creates a new task. The conflict resolver handles deduplication at the operational level (reject if duplicate operation running).
Firmware update polling is idempotent: Checking firmware status multiple times returns the same state without side effects.

Failure Modes

Firmware Upgrade Flow

Step	What Can Fail	Current Handling	Correctness
JWT Validation	Expired token, invalid signature, JWKS endpoint down	401 Unauthorized returned immediately. JWKS is cached with TTL.	Correct
RLA gRPC Call	RLA service unavailable, network timeout	gRPC returns UNAVAILABLE. API returns 503 to client.	Correct
Identifier Resolution	Serial number not found in inventory	Task creation fails with descriptive error. No partial task created.	Correct
Advisory Lock Acquisition	Lock held by crashed process	Advisory locks auto-release on connection close. Timeout after configurable wait.	Correct
Temporal Workflow Start	Temporal cluster unavailable	Task remains in 'pending' state. Executor retries with backoff.	Partial - task may stay pending indefinitely if Temporal is down long-term
BMC Firmware Flash	BMC unreachable, firmware corrupt, power loss during flash	Activity retries (3 attempts). On final failure, task marked 'failed'. BMC may be in inconsistent state.	Partial - no automatic rollback of partial firmware writes
Conflict Promoter	Promoter goroutine crashes, DB connection lost	Waiting tasks have TTL (queue_expires_at). Expired tasks are not promoted. Next task submission re-triggers promotion check.	Correct - TTL prevents indefinite waiting

Instance Provisioning Flow

Step	What Can Fail	Current Handling	Correctness
IP Allocation (IPAM)	Subnet exhausted, IPAM service down	Transaction rolled back. Instance not created. Client gets 500 or specific error.	Correct - atomic with DB transaction
Machine Selection	No available machines matching instance type	Handler returns 409 Conflict with descriptive message.	Correct
OS Installation	PXE boot failure, image download timeout, disk error	Temporal activity retries. After max retries, instance status set to Error.	Correct - but machine may need manual recovery
Phone-Home Callback	Network misconfiguration, callback endpoint unreachable	Temporal workflow has activity timeout. Instance stuck in Configuring, eventually times out to Error.	Partial - timeout may be too long, leaving instance in limbo

Cascading Failures

If PostgreSQL is down: All services fail. API returns 503. Tasks cannot be created or updated. Temporal workflows stall on DB activities. The system is fully dependent on PostgreSQL availability.
If Temporal is down: New workflow submissions fail, but existing workflows resume when Temporal recovers (durable execution). API can still serve read requests. Task creation succeeds but execution stalls.
If Keycloak is down: All authenticated API requests fail (401). JWKS cache provides a grace period (minutes). Service-account tokens with long TTL continue to work until expiry.
If a gRPC sidecar (NSM/PSM) is down: Only that hardware type is affected. Firmware upgrades for the affected type fail; other types proceed normally. RLA handles partial failures per component type.

Architecture Strengths

Temporal-based durability: Hardware operations survive process crashes, restarts, and network partitions. Workflows resume exactly where they left off.
Clean conflict resolution: The advisory lock + queue + TTL pattern prevents dangerous concurrent operations on hardware while allowing controlled queuing.
Multi-tenant isolation: Organization-scoped URLs, JWT-enforced access, and DB-level tenant isolation prevent cross-tenant data leaks.
Extensible hardware support: The sidecar pattern makes it straightforward to add new hardware types (CDU, UMS) without modifying the core RLA or API services.
Comprehensive observability: OpenTelemetry tracing, Prometheus metrics, structured logging (Zerolog), Sentry error tracking, and audit trails provide full visibility.
Strong CI/CD pipeline: 100+ linters, automated security scanning (TruffleHog, Trivy), integration tests with real PostgreSQL, and Docker image builds ensure code quality.

Architecture Risks

Single PostgreSQL dependency: All services share one PostgreSQL instance (different schemas but same cluster). A database outage is a total system failure. Consider read replicas or per-service databases for critical paths.
No firmware rollback automation: If a firmware flash fails mid-write, the BMC may be in an inconsistent state. There's no automated recovery path; manual intervention is required.
Temporal as single point of failure for operations: While Temporal itself is designed for HA, the deployment depends on a single Temporal cluster. If it's misconfigured or under-provisioned, all async operations stall.
Advisory lock scope is per-rack only: If a component moves between racks (physically relocated), the lock granularity doesn't protect against conflicts during the transition period.
110+ migrations in the main DB: The main infrastructure database has grown to 110+ migrations. Schema evolution is becoming complex; consider migration squashing or versioned schema snapshots.