My job alerts

Machine Learning Systems Engineer

Inception Labs

Software Engineering

San Francisco, CA, USA

Posted 6+ months ago

Apply now

Machine Learning Systems Engineer

Bay Area

Engineering

In office

Full-time

About Us:

Inception is a generative AI startup. Leveraging breakthrough AI research, we are training next-generation large language models (LLM) powered by diffusion. Unlike existing auto-regressive models, which only output one token at a time, diffusion LLMs can output many tokens in parallel. This means that they are several times faster and can leverage their additional test-time compute to improve quality. They also enable fine-grained control over their outputs to adhere to specific schema and semantic constraints, and they provide a unified paradigm for combining language with other data modalities, including audio, images, and videos.

Our team is led by Stefano Ermon (co-inventor of diffusion models, flash attention, and DPOl faculty at Stanford), Aditya Grover (co-inventor of node2vec and decision transformers; faculty at UCLA), and Volodymyr Kuleshov (prev. co-founder and CTO at Afresh Technologies; faculty at Cornell), and includes engineers from Google Deepmind, Meta AI, Microsoft AI, and OpenAI. We are in the process of deploying our models at Fortune 500 companies.

Role Overview:

We are looking for ML Systems Engineers with a strong background in distributed systems, infrastructure engineering, and machine learning operations. In this role, you will work on designing and implementing the infrastructure that powers our ML training and inference systems. You will collaborate with ML researchers and engineers to build efficient, reliable, and scalable systems that enable the development and deployment of state-of-the-art LLMs.

Key Responsibilities:

Design and implement distributed training infrastructure for large-scale machine learning models
Build and optimize high-performance model serving systems for low-latency inference
Develop automated pipelines for data preprocessing, model training, and deployment
Create monitoring and observability solutions for ML systems in production
Optimize infrastructure costs and resource utilization across GPU clusters
Design and implement efficient data storage and retrieval systems for ML workloads
Collaborate with ML researchers to translate theoretical requirements into practical system designs

Qualifications:

BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience)
Strong software engineering fundamentals and systems design principles
Extensive experience with distributed systems and cloud computing platforms (AWS/GCP/Azure)
Proficiency in Python and at least one systems programming language (C++/Rust/Go)
Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines
Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective
Familiarity with high-performance computing and GPU programming (CUDA)

Preferred Skills:

Experience building and maintaining large-scale ML training clusters
Knowledge of ML serving frameworks (vLLM, TensorRT, ONNX Runtime)
Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel)
Experience with ML workflow orchestration tools (Kubeflow, Airflow)
Background in performance optimization and profiling of ML systems
Knowledge of ML-specific infrastructure challenges (checkpointing, resource scheduling, etc.)
Experience with MLOps practices and tooling

Why Join Us:

Impact: Build the infrastructure that enables next-generation AI development
Innovation: Solve complex distributed systems challenges in the ML domain
Growth: Shape the architecture of our ML platform from the ground up

Perks & Benefits:

Competitive salary and equity in a rapidly growing startup
Flexible vacation and paid time off (PTO)
Health, dental, and vision insurance
Professional development budget for conferences and courses
Access to the latest GPU hardware and cloud resources
A collaborative and inclusive culture where your voice matters

This is an exciting opportunity to join a startup at the forefront of LLM development! If you're ready to build the systems that power the future of AI, apply today.

We are an equal opportunity employer and encourage candidates of all backgrounds to apply.

Req ID: R3

Apply now

See more open positions at Inception Labs

Job board

Machine Learning Systems Engineer