Senior Software Engineer, Cluster Orchestration
As a Senior Software Engineer in Cluster Orchestration at CoreWeave, you will be instrumental in advancing the company's orchestration platform, including SUNK (Slurm on Kubernetes), which underpins large-scale AI training and inference workloads. This role offers the opportunity to shape a critical layer of the AI cloud, ensuring seamless, reliable, and efficient operation across extensive GPU clusters. By developing systems that eliminate infrastructure bottlenecks and introduce new orchestration capabilities, you will directly empower customers to innovate more rapidly and expand the possibilities of AI applications.
In this position, you will own multiple services within the orchestration platform, leading design and code reviews, breaking down projects into actionable milestones, and driving measurable improvements in system reliability and performance. Your responsibilities will include defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for your services, enhancing operational practices, and mentoring junior engineers. Your efforts will ensure consistent enhancements in throughput, latency, and overall system resilience for CoreWeave's customers.
The ideal candidate will have approximately 3 to 5 years of professional experience in software engineering, specifically in building distributed systems or cloud services. Proficiency in Go is essential, with additional experience in Python or C++ being advantageous. A solid foundation in computer science principles is required, along with hands-on experience managing Kubernetes at production scale. Familiarity with observability tools such as Prometheus, Grafana, and OpenTelemetry is important, as is a proven ability to improve service reliability and performance using metrics like P95/P99 latency, throughput, and error budgets.
Preferred qualifications include experience with orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows. Knowledge of distributed workloads, GPU-based applications, or machine learning pipelines is beneficial, as is an understanding of scheduling concepts like quota enforcement, pre-emption, and scaling strategies. Exposure to reliability practices, including setting SLOs, configuring alarms, and conducting post-incident reviews, is also desirable.
CoreWeave offers a competitive compensation package, with a base salary range of $139,000 to $204,000, determined based on job-related knowledge, skills, experience, and market location. The total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program. Benefits encompass medical, dental, and vision insurance fully paid by CoreWeave, company-paid life insurance, short and long-term disability insurance, flexible spending and health savings accounts, tuition reimbursement, participation in the Employee Stock Purchase Program (ESPP), mental wellness benefits, family-forming support, paid parental leave, flexible childcare support, a 401(k) plan with a generous employer match, flexible paid time off, and a casual work environment focused on innovative disruption.
At CoreWeave, we work hard, have fun, and move fast. We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: Be Curious at Your Core, Act Like an Owner, Empower Employees, Deliver Best-in-Class Client Experiences, and Achieve More Together. We support and encourage an entrepreneurial outlook and independent thinking, fostering an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for takeoff, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!