Staff ML Performance Engineer (Training Efficiency)
Wayve is seeking a Staff ML Performance Engineer to join their Training Tech team in Sunnyvale, California. This role focuses on optimizing large-scale machine learning jobs to enhance training efficiency, enabling the company to scale its models effectively. Wayve, founded in 2017, is a leader in Embodied AI technology, developing advanced AI software and foundation models that empower vehicles to perceive, understand, and navigate complex environments, thereby improving the usability and safety of automated driving systems.
The primary responsibilities of this position include profiling machine learning workloads to identify bottlenecks using tools like NVIDIA Nsight Systems, designing and implementing efficiency improvements to maximize machine utilization and throughput through methods such as parallelism, model compilation, and mixed precision. Additionally, the role involves developing observability tools to track metrics like machine utilization, throughput, and latency, as well as creating benchmarking tools to monitor efficiency gains or regressions. Collaboration with research teams is essential to integrate training efficiency improvements and foster a culture of performance optimization.
Candidates should possess over 10 years of industry experience in performance engineering across machine learning systems, GPU compute infrastructure, distributed platforms, or similar fields. Experience in optimizing large-scale jobs on GPU compute clusters and working within platform teams alongside research teams is crucial. Proficiency in writing, reporting, and tracking performance benchmarks in an accessible manner is required, along with the ability to write high-quality, well-structured, and tested Python code. A Bachelor’s or Master’s degree in Machine Learning, Computer Science, Engineering, or a related technical discipline, or equivalent experience, is necessary.
Preferred qualifications include experience with concurrent, parallel, and distributed computing, familiarity with system profilers like NVIDIA Nsight Systems, and expertise in implementing GPU kernels using CUDA, Triton, or similar technologies. A solid understanding of computing fundamentals, including factors that contribute to code efficiency, security, and reliability, is also advantageous.
Wayve offers a dynamic and inclusive work environment that values diversity and encourages continuous learning and innovation. Employees have the opportunity to work on cutting-edge AI technologies that are shaping the future of autonomous driving. The company fosters a culture of collaboration and performance optimization, providing growth opportunities for individuals passionate about making a positive impact in the field of self-driving technology.