Cloud Infrastructure and Service Orchestrator Architect
Gruve
About Gruve
Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks.
About the Role
We are seeking an expert Data Center & Cloud Infrastructure and Service Orchestrator Architect to design and implement the service orchestration layer that will deploy and manage diverse workloads on top of our multi-region cloud infrastructure. This role focuses on creating the intelligent orchestration systems that automate the deployment, scaling, and management of applications, databases, AI/ML services, and other cloud services. This position is being hired to support a Gruve customer and may require on-site work at the customer’s location.
Key Responsibilities
Service Orchestration Platform Design
- Design comprehensive service orchestration platforms for automated workload deployment and management
- Architect API-driven service provisioning systems with self-service capabilities
- Design multi-tenant service isolation and resource allocation frameworks
- Create service lifecycle management systems including deployment, scaling, updates, and decommissioning
Workload Orchestration Architecture
- Design orchestration systems for diverse workload types:
- Virtual machine provisioning and management
- Container orchestration using Kubernetes
- Database service deployment (SQL, NoSQL, distributed databases)
- Message queue services (Kafka, RabbitMQ, Apache Pulsar)
- GPU-accelerated AI/ML services and model inference platforms
- Large Language Model (LLM) fine-tuning and inference services similar to AWS Bedrock
AI/ML Service Orchestration
- Architect AI/ML pipeline orchestration for model training, validation, and deployment
- Design GPU resource scheduling and allocation systems for distributed training
- Create model serving infrastructure with auto-scaling and load balancing
- Design MLOps platforms for continuous integration and deployment of ML models
- Architect LLM inference services with dynamic scaling and cost optimization
Service Discovery and Integration
- Design service mesh architectures for microservices communication
- Architect API gateway and service proxy solutions
- Create service discovery, configuration management, and secrets management systems
- Design inter-service communication patterns and protocols
Automation and DevOps Integration
- Design CI/CD pipelines integrated with service orchestration platforms
- Architect GitOps workflows for declarative service management
- Create policy-based governance and compliance automation
- Design cost management and resource optimization automation
Basic Qualifications
-
12+ years of experience building and managing distributed systems and service orchestration architectures.
-
Expertise in Kubernetes and container orchestration, with more than 8 years of hands-on experience.
-
Designed and deployed AI/ML infrastructure using platforms like Kubeflow and model serving tools such as NVIDIA Triton or TorchServe.
-
Proficient in Go and Python, and skilled in writing automation scripts using Bash or Python.
-
Bachelor's degree in Computer Science or a related field and possess strong system design and architecture capabilities.
Preferred Qualifications
-
You have experience building PaaS or IaaS offerings for internal or external developer platforms.
-
You are familiar with edge computing paradigms and serverless technologies including function-as-a-service frameworks.
-
You hold certifications such as CKA, CKAD, CKS, or equivalent credentials from major cloud providers.
-
You’ve worked on cloud cost optimization initiatives and have hands-on experience with FinOps practices.
-
You hold a Master’s degree in distributed systems, computer science, or a closely related discipline.
Why Gruve
At Gruve, we foster a culture of innovation, collaboration, and continuous learning. We are committed to building a diverse and inclusive workplace where everyone can thrive and contribute their best work. If you’re passionate about technology and eager to make an impact, we’d love to hear from you.
Gruve is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.