Power-Resilience by Design: Making a Stateful Kubernetes Fleet Survive Losing Its GPU Nodes
The Setup
In a heterogeneous cluster, not every node is always on. Ours has an always-on storage and control tier and a GPU compute tier that is deliberately powered down most of the time — compute is expensive to idle, and the workloads that need it are bursty. That design has a sharp edge: any service whose persistent state, or whose scheduling, depends on a GPU node becomes silently unusable the moment that tier sleeps. A green dashboard at 2pm tells you nothing about what happens at 2am when the GPU nodes are dark.
Power-resilience, in other words, has to be engineered in. It cannot be assumed, and — crucially — it cannot be confirmed by looking. This post is about finding every hidden dependency on the compute tier, removing it, and then proving the result by physically powering the tier off.
1. Audit Before You Touch Anything
The dependency inventory came from a read-only, multi-agent audit run against the live cluster: eight discovery lenses sweeping every Deployment, StatefulSet, PVC, StorageClass, and scheduling constraint, followed by a completeness critic whose only job was to ask "what did the eight lenses miss?", followed by gap-fill verifiers. A companion drain-resilience audit covered all 79 live workloads across 42 namespaces, classifying each by what would happen if its node drained: most were fine, but a meaningful tail carried a read-write-once volume detach risk, a hard node pin, or a missing anti-affinity rule.
The total movable persistent state turned out to be under 150 GiB — trivially absorbed by the storage tiers, which had terabytes of headroom across replicated-NVMe, replicated-SSD, and bulk HDD-backed CephFS pools. The problem was never capacity. It was placement.
2. The Footguns the Critic Caught
Two findings justify the entire "audit first" discipline, because a naive remediation pass would have walked straight into both.
The default-StorageClass landmine. The cluster's *default* StorageClass was a node-local provisioner. Any chart that omitted an explicit storageClassName silently provisioned its volume node-local — and if that pod happened to schedule on a GPU node, its state was now stranded on a machine that is off most of the day. This is exactly how a cache replica and a vulnerability scanner had quietly pinned themselves to the compute tier. Nobody chose that placement; the default chose it for them.
The triple-anchored cronjob. A reporting cronjob was pinned to a GPU node three different ways at once: a nodeSelector *plus* two hostPath mounts. The obvious fix — delete the nodeSelector — would have been a silent-data-loss trap: the job would have rescheduled onto a node where those hostPaths were empty directories, and quietly lost its accumulated report history. (The completeness critic also noticed the job had already been failing unnoticed for two weeks, and that its upstream data writer had been dead far longer.) The right fix was to move the state onto a shared volume *first*, seeding it from the original node, and only then unpin.
A third correction was almost comic: a workload everyone assumed was GPU-bound was actually a CPU-only image, needlessly stranded on the sleeping tier. In fact no node in the fleet even advertised a GPU to Kubernetes — the device plugin was not installed — so *every* Kubernetes workload on the compute tier was relocatable, and only the bare-metal serving processes were inherently node-bound.
3. The Remediation
- Widen scheduling, don't just unpin. Roughly a dozen workloads had their required node affinity broadened from "compute tier only" to "compute *or* storage tier," with hostname pins removed, so the scheduler can place them wherever there is room.
- Move stateful volumes to Ceph. Node-local volumes were migrated onto replicated-SSD and CephFS-backed StorageClasses, with data seeded across so nothing was lost; one redundant cache replica was simply retired in favour of its authoritative master on the always-on tier.
- Migrate the cronjobs onto shared storage with their history preserved, breaking the triple anchor for good.
- Disarm the landmine permanently. The cluster default StorageClass was flipped from the node-local provisioner to a Ceph-backed replicated class, codified durably so a future chart that forgets to name a StorageClass lands on durable shared storage rather than a sleeping node.
Every change was both applied live *and* written back into the canonical in-repo manifests — the only honest way to fix drift is to edit source to match reality, not to blindly re-apply manifests that may be behind what is running.
4. Proof by Power-Off
Acceptance was not a green dashboard. Both GPU nodes were cordoned, drained, and then physically powered off via their baseboard controllers until the chassis reported power off. With the compute tier dark, the cluster held: zero Pending, Unschedulable, or Error pods; storage HEALTH_OK (the compute nodes host no storage daemons, so there was no cascade); and monitoring intact, with no node exporter wedging on a now-dead network mount. The nodes were then powered back on, rejoined Ready, and the GPU serving stack auto-restarted.
That is the whole point. You do not get to *claim* power-resilience. You prove it by cutting the power and watching nothing break.
The Lessons
A default StorageClass is a silent placement policy — make it a durable one, not a node-local accident. Read-write-once volumes detach on reschedule and carry a real stale-lock hazard, so relocation order matters. A single one-line "fix" can be a data-loss trap when state is anchored in more than one place; a completeness critic that asks "what did we miss?" earns its keep. And the only acceptance test for graceful degradation is the ungraceful event itself.