bare-metal-manager-rest Deep-Dive Analysis

Design Decisions

1. Microservices Monorepo with Shared go.mod

What it is: All 10 microservices (API, RLA, workflow worker, site-agent, site-manager, IPAM, NVSwitch Manager, PowerShelf Manager, cert-manager, CLI) live in a single Go module with one go.mod at the root.

Where it appears: Root go.mod with github.com/NVIDIA/ncx-infra-controller-rest module path. Each service has its own cmd/ entry point and internal/ package tree.

What problem it solves: Shared types and utilities (common/, auth/, db/) can be imported directly without versioning headaches. Refactoring a shared type updates all consumers atomically in one commit.

What would break without it: Splitting into separate modules would require publishing internal packages, managing version compatibility between services, and coordinating multi-repo PRs for cross-cutting changes.

Alternatives: Multi-module monorepo (Go workspace), separate repositories per service. The monorepo was chosen because the services share significant internal types (DB models, auth, common config) and must evolve together.

2. Temporal for Workflow Orchestration

What it is: All long-running operations (firmware upgrades, instance provisioning, power control, inventory sync) are implemented as Temporal workflows rather than synchronous API calls or simple async job queues.

Where it appears: workflow/ (cloud workflows), site-workflow/ (site workflows), and the RLA task executor in rla/internal/task/executor/.

What problem it solves: Hardware operations can take minutes to hours (firmware flash, OS install). They need retries, timeouts, fan-out/fan-in, and crash recovery. Temporal provides all of this declaratively.

What would break without it: Replacing Temporal with a simple job queue would lose: automatic activity retries with backoff, workflow state persistence across crashes, saga-style compensation for partial failures, and the ability to query running workflow state.

Alternatives: Redis-based job queues (Asynq), database-backed state machines, Kubernetes Jobs. Temporal was chosen for its durability guarantees and native Go SDK support.

3. gRPC Sidecar Pattern for Hardware Managers

What it is: NVSwitch Manager and PowerShelf Manager run as gRPC sidecars alongside the RLA service rather than being called directly from the REST API.

Where it appears: rla/internal/nsmapi/ and rla/internal/psmapi/ connect to localhost gRPC endpoints. The RLA service acts as a unified facade.

What problem it solves: Different hardware types (GPU switches, power shelves, compute nodes) have completely different management protocols and firmware update mechanisms. The sidecar pattern keeps each manager independent while RLA provides a unified task abstraction.

What would break without it: Embedding all hardware-specific logic in RLA would create a monolithic service that's hard to test, deploy, and upgrade independently. Different hardware vendors could block each other.

Alternatives: Direct REST-to-hardware calls, plugin architecture within RLA, separate top-level services with independent APIs. The sidecar approach was chosen for deployment simplicity (co-located with RLA) while maintaining code isolation.

4. Advisory Locks for Task Conflict Resolution

What it is: PostgreSQL advisory locks (pg_advisory_lock) are used to serialize task submissions per rack, preventing concurrent conflicting operations on the same hardware.

Where it appears: rla/internal/task/conflict/ - the ConflictResolver acquires advisory locks before checking for conflicts, and the ConflictPromoter auto-promotes waiting tasks.

What problem it solves: Two admins might simultaneously submit a power-off and firmware-upgrade targeting the same rack. Without serialization, both could start, leading to hardware damage or corrupted firmware.

What would break without it: Race conditions between task submissions. Regular database transactions only prevent data conflicts, not logical conflicts between concurrent hardware operations.

Alternatives: Application-level mutexes (single-instance only), distributed locks (Redis/etcd), optimistic concurrency with version columns. Advisory locks were chosen because they're built into PostgreSQL, support per-rack granularity, and automatically release on connection close.

5. Handler Dependency Injection Pattern

What it is: Each HTTP handler is a struct with constructor injection: NewXxxHandler(dbSession, temporalClient, siteClientPool, config). No global state or service locator.

Where it appears: All 86 handler files in api/pkg/api/handler/. Route registration in api/pkg/api/routes.go instantiates handlers with shared dependencies.

What problem it solves: Each handler explicitly declares its dependencies, making it testable (mock any dependency) and clear what external systems a given endpoint touches.

What would break without it: Global state or implicit service locators make it impossible to run handlers in isolation, lead to hidden coupling, and make the dependency graph opaque.

Alternatives: Wire/dig dependency injection frameworks, global singleton services, context-based injection. The manual constructor pattern was chosen for explicitness and Go idiom compliance.

6. Dual Database Layer (Bun ORM + Raw pgx)

What it is: The main API uses Bun ORM for type-safe queries and model mapping, while the connection pool uses raw pgx for performance-critical operations like advisory locks and bulk operations.

Where it appears: db/pkg/db/ provides the session (pgx pool), Bun wraps it for ORM queries. Advisory locks use raw SQL. RLA and hardware managers also use Bun for their own schemas.

What problem it solves: Bun provides type safety and query building for standard CRUD, while pgx gives full control for PostgreSQL-specific features (advisory locks, LISTEN/NOTIFY, custom types).

What would break without it: Using only an ORM would lose access to PostgreSQL-specific features. Using only raw SQL would require manual model mapping and lose type safety.


Invariants

State Machine Constraints

Database Constraints

Authorization Invariants

Concurrency & Consistency

Idempotency


Failure Modes

Firmware Upgrade Flow

Step What Can Fail Current Handling Correctness
JWT Validation Expired token, invalid signature, JWKS endpoint down 401 Unauthorized returned immediately. JWKS is cached with TTL. Correct
RLA gRPC Call RLA service unavailable, network timeout gRPC returns UNAVAILABLE. API returns 503 to client. Correct
Identifier Resolution Serial number not found in inventory Task creation fails with descriptive error. No partial task created. Correct
Advisory Lock Acquisition Lock held by crashed process Advisory locks auto-release on connection close. Timeout after configurable wait. Correct
Temporal Workflow Start Temporal cluster unavailable Task remains in 'pending' state. Executor retries with backoff. Partial - task may stay pending indefinitely if Temporal is down long-term
BMC Firmware Flash BMC unreachable, firmware corrupt, power loss during flash Activity retries (3 attempts). On final failure, task marked 'failed'. BMC may be in inconsistent state. Partial - no automatic rollback of partial firmware writes
Conflict Promoter Promoter goroutine crashes, DB connection lost Waiting tasks have TTL (queue_expires_at). Expired tasks are not promoted. Next task submission re-triggers promotion check. Correct - TTL prevents indefinite waiting

Instance Provisioning Flow

Step What Can Fail Current Handling Correctness
IP Allocation (IPAM) Subnet exhausted, IPAM service down Transaction rolled back. Instance not created. Client gets 500 or specific error. Correct - atomic with DB transaction
Machine Selection No available machines matching instance type Handler returns 409 Conflict with descriptive message. Correct
OS Installation PXE boot failure, image download timeout, disk error Temporal activity retries. After max retries, instance status set to Error. Correct - but machine may need manual recovery
Phone-Home Callback Network misconfiguration, callback endpoint unreachable Temporal workflow has activity timeout. Instance stuck in Configuring, eventually times out to Error. Partial - timeout may be too long, leaving instance in limbo

Cascading Failures


Architecture Strengths

  1. Temporal-based durability: Hardware operations survive process crashes, restarts, and network partitions. Workflows resume exactly where they left off.
  2. Clean conflict resolution: The advisory lock + queue + TTL pattern prevents dangerous concurrent operations on hardware while allowing controlled queuing.
  3. Multi-tenant isolation: Organization-scoped URLs, JWT-enforced access, and DB-level tenant isolation prevent cross-tenant data leaks.
  4. Extensible hardware support: The sidecar pattern makes it straightforward to add new hardware types (CDU, UMS) without modifying the core RLA or API services.
  5. Comprehensive observability: OpenTelemetry tracing, Prometheus metrics, structured logging (Zerolog), Sentry error tracking, and audit trails provide full visibility.
  6. Strong CI/CD pipeline: 100+ linters, automated security scanning (TruffleHog, Trivy), integration tests with real PostgreSQL, and Docker image builds ensure code quality.

Architecture Risks

  1. Single PostgreSQL dependency: All services share one PostgreSQL instance (different schemas but same cluster). A database outage is a total system failure. Consider read replicas or per-service databases for critical paths.
  2. No firmware rollback automation: If a firmware flash fails mid-write, the BMC may be in an inconsistent state. There's no automated recovery path; manual intervention is required.
  3. Temporal as single point of failure for operations: While Temporal itself is designed for HA, the deployment depends on a single Temporal cluster. If it's misconfigured or under-provisioned, all async operations stall.
  4. Advisory lock scope is per-rack only: If a component moves between racks (physically relocated), the lock granularity doesn't protect against conflicts during the transition period.
  5. 110+ migrations in the main DB: The main infrastructure database has grown to 110+ migrations. Schema evolution is becoming complex; consider migration squashing or versioned schema snapshots.