Director of GPU Fleet Operations
Gruve
About Gruve
Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks.
About the Role
Gruve is a rapidly growing company enabling NEO Clouds to deliver GPU-as-a-Service and AI infrastructure to AI-native startups, enterprises, and research organizations. Our distributed fleet of GPU clusters spans colocation facilities, edge sites, and modular data centers globally, operating at the intersection of high-performance computing, AI infrastructure, and cloud-scale automation.
We are seeking a Director of GPU Fleet Operations to own the end-to-end lifecycle, reliability, and performance of our global GPU fleet, along with adjacent CPU and high-performance storage clusters. This leader will drive strategy, execution, and scaling of hardware and infrastructure operations for thousands of GPUs across distributed environments, building remote operations teams, advancing automation, and leveraging AI to create a highly reliable, self-healing GPU cloud platform.
Key Responsibilities
Fleet Strategy & Operations
- Own operational readiness, uptime, and performance of the global GPU fleet.
- Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high-speed networking (InfiniBand/RoCE).
- Standardize operations across liquid- and air-cooled environments, colocation sites, and modular data centers.
- Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management.
Hardware Lifecycle & Reliability
- Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations.
- Develop scalable processes for diagnostics, RMA coordination, spare-parts forecasting, and reliability engineering.
- Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization.
Remote Operations & NOC Leadership
- Build and lead a 24×7 global remote operations organization.
- Develop a remote-first model to manage distributed clusters.
- Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry.
Software & Platform Maintenance
- Partner with Platform/DevOps teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow).
- Oversee GPU drivers, firmware, CUDA stack, and configuration automation.
- Own patching, upgrades, change management, and low-impact maintenance practices.
- Manage platform layers operating above Kubernetes, including agent infrastructure.
AI-Driven Operations (AIOps)
- Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation.
- Build toward an autonomous, self-healing GPU fleet through data-driven automation.
Vendor & Field Coordination
- Manage OEM and repair vendor relationships and enforce SLAs.
- Coordinate global field technicians and remote hands support.
Capacity & Customer Operations
- Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance.
- Support large-scale deployments, escalations, and on-premise customer installations.
Team Leadership & Scaling
- Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering.
- Establish KPIs, dashboards, and operational reporting to support rapid growth.
Basic Qualifications
- 10+ years of experience in infrastructure, data center, or cloud operations.
- 5+ years managing distributed hardware fleets or large-scale compute environments.
- Experience operating GPU, HPC, or high-performance compute clusters.
- Proven experience leading 24×7 operations teams.
- Strong technical understanding of:
- GPU servers and accelerator infrastructure
-
High-speed networking (InfiniBand/RoCE)
-
Linux systems and hardware troubleshooting
-
Cluster orchestration (Kubernetes, Slurm, or similar)
-
Monitoring and observability platforms
-
Hardware lifecycle management and RMA processes
-
Incident response and SRE practices
-
- Experience applying automation or AI-driven approaches such as AIOps, telemetry analytics, predictive maintenance, and self-healing workflows.
Preferred Qualifications
- Experience working in GPU Cloud, Neo Cloud, or AI infrastructure environments.
- Familiarity with liquid-cooled data centers.
- Experience managing distributed edge or modular data center deployments.
- Background in Site Reliability Engineering, Reliability Engineering, or HPC operations.
- Experience building automation at hyperscale or large fleet environments.
- Demonstrated ability to scale global technical teams and operate effectively in fast-growth startup settings.
Salary Range
$245,000 - $250,000 + Performance Bonus + Equity
Why Gruve
At Gruve, we foster a culture of innovation, collaboration, and continuous learning. We are committed to building a diverse and inclusive workplace where everyone can thrive and contribute their best work. If you’re passionate about technology and eager to make an impact, we’d love to hear from you.
Gruve is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.